Language-Driven Region Pointer Advancement for Controllable Image Captioning

Autor: Robert Ross, John D. Kelleher, Annika Lindh
Přispěvatelé: John D. Kelleher and Robert J. Ross, ADAPT SFI Research Centre, SFI Research Centres Programme
Rok vydání: 2020
Předmět:
Closed captioning
FOS: Computer and information sciences
Vocabulary
Computer Science - Machine Learning
Artificial Intelligence and Robotics
Computer science
controllable image captioning
Speech recognition
media_common.quotation_subject
Computer Vision and Pattern Recognition (cs.CV)
Computer Science - Computer Vision and Pattern Recognition
02 engineering and technology
computer vision
Machine Learning (cs.LG)
03 medical and health sciences
0302 clinical medicine
0202 electrical engineering
electronic engineering
information engineering

Neural and Evolutionary Computing (cs.NE)
I.2.7
I.2.10
I.5.1
media_common
Computer Science - Computation and Language
business.industry
Pointer (user interface)
Deep learning
Natural language generation
Computer Science - Neural and Evolutionary Computing
deep learning
natural language generation
machine learning
030221 ophthalmology & optometry
68T07
68T45
68T50

020201 artificial intelligence & image processing
Artificial intelligence
business
Computation and Language (cs.CL)
Natural language
Sentence
Test data
Zdroj: Conference papers
COLING
Popis: Controllable Image Captioning is a recent sub-field in the multi-modal task of Image Captioning wherein constraints are placed on which regions in an image should be described in the generated natural language caption. This puts a stronger focus on producing more detailed descriptions, and opens the door for more end-user control over results. A vital component of the Controllable Image Captioning architecture is the mechanism that decides the timing of attending to each region through the advancement of a region pointer. In this paper, we propose a novel method for predicting the timing of region pointer advancement by treating the advancement step as a natural part of the language structure via a NEXT-token, motivated by a strong correlation to the sentence structure in the training data. We find that our timing agrees with the ground-truth timing in the Flickr30k Entities test data with a precision of 86.55% and a recall of 97.92%. Our model implementing this technique improves the state-of-the-art on standard captioning metrics while additionally demonstrating a considerably larger effective vocabulary size.
Accepted to COLING 2020
Databáze: OpenAIRE