Language-Driven Region Pointer Advancement for Controllable Image Captioning
Autor: | Robert Ross, John D. Kelleher, Annika Lindh |
---|---|
Přispěvatelé: | John D. Kelleher and Robert J. Ross, ADAPT SFI Research Centre, SFI Research Centres Programme |
Rok vydání: | 2020 |
Předmět: |
Closed captioning
FOS: Computer and information sciences Vocabulary Computer Science - Machine Learning Artificial Intelligence and Robotics Computer science controllable image captioning Speech recognition media_common.quotation_subject Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition 02 engineering and technology computer vision Machine Learning (cs.LG) 03 medical and health sciences 0302 clinical medicine 0202 electrical engineering electronic engineering information engineering Neural and Evolutionary Computing (cs.NE) I.2.7 I.2.10 I.5.1 media_common Computer Science - Computation and Language business.industry Pointer (user interface) Deep learning Natural language generation Computer Science - Neural and Evolutionary Computing deep learning natural language generation machine learning 030221 ophthalmology & optometry 68T07 68T45 68T50 020201 artificial intelligence & image processing Artificial intelligence business Computation and Language (cs.CL) Natural language Sentence Test data |
Zdroj: | Conference papers COLING |
Popis: | Controllable Image Captioning is a recent sub-field in the multi-modal task of Image Captioning wherein constraints are placed on which regions in an image should be described in the generated natural language caption. This puts a stronger focus on producing more detailed descriptions, and opens the door for more end-user control over results. A vital component of the Controllable Image Captioning architecture is the mechanism that decides the timing of attending to each region through the advancement of a region pointer. In this paper, we propose a novel method for predicting the timing of region pointer advancement by treating the advancement step as a natural part of the language structure via a NEXT-token, motivated by a strong correlation to the sentence structure in the training data. We find that our timing agrees with the ground-truth timing in the Flickr30k Entities test data with a precision of 86.55% and a recall of 97.92%. Our model implementing this technique improves the state-of-the-art on standard captioning metrics while additionally demonstrating a considerably larger effective vocabulary size. Accepted to COLING 2020 |
Databáze: | OpenAIRE |
Externí odkaz: |