Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze
Autor: | Takmaz, E., Pezzelle, S., Beinborn, L., Fernández, R., Webber, B., Cohn, T., He, Y., Liu, Y. |
---|---|
Přispěvatelé: | Language and Computation (ILLC, FNWI/FGw), ILLC (FNWI), Brain and Cognition, Faculty of Science, Language, Network Institute |
Jazyk: | angličtina |
Rok vydání: | 2020 |
Předmět: |
FOS: Computer and information sciences
Closed captioning Computer Science - Computation and Language Modality (human–computer interaction) Point (typography) business.industry Computer science Computer Vision and Pattern Recognition (cs.CV) 05 social sciences Computer Science - Computer Vision and Pattern Recognition ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION 02 engineering and technology Gaze 050105 experimental psychology Visual processing Modal Component (UML) 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing 0501 psychology and cognitive sciences Computer vision Artificial intelligence business Computation and Language (cs.CL) |
Zdroj: | Takmaz, E, Pezzelle, S, Beinborn, L & Fernández, R 2020, Generating Image Descriptions via Sequential Cross-Modal Alignment Guided by Human Gaze . in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) . Association for Computational Linguistics, pp. 4664-4677 . https://doi.org/10.18653/v1/2020.emnlp-main.377 EMNLP (1) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020 Conference on Empirical Methods in Natural Language Processing: EMNLP 2020 : proceedings of the conference : November 16-20, 2020 2020 Conference on Empirical Methods in Natural Language Processing Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4664-4677 STARTPAGE=4664;ENDPAGE=4677;TITLE=Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) |
Popis: | When speakers describe an image, they tend to look at objects before mentioning them. In this paper, we investigate such sequential cross-modal alignment by modelling the image description generation process computationally. We take as our starting point a state-of-the-art image captioning system and develop several model variants that exploit information from human gaze patterns recorded during language production. In particular, we propose the first approach to image description generation where visual processing is modelled $\textit{sequentially}$. Our experiments and analyses confirm that better descriptions can be obtained by exploiting gaze-driven attention and shed light on human cognitive processes by comparing different ways of aligning the gaze modality with language production. We find that processing gaze data sequentially leads to descriptions that are better aligned to those produced by speakers, more diverse, and more natural${-}$particularly when gaze is encoded with a dedicated recurrent component. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020) |
Databáze: | OpenAIRE |
Externí odkaz: |