Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions

Autor:	Michele Cafagna, Kees van Deemter, Albert Gatt
Jazyk:	angličtina
Rok vydání:	2022
Předmět:	image captioning Computer Science - Computation and Language multimodal grounding vision and language Computer Science - Computer Vision and Pattern Recognition
Popis:	Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the potential of a state of the art Vision and Language model, VinVL, to caption images at the scene level using (1) a novel dataset which pairs images with both object-centric and scene descriptions. Through (2) an in-depth analysis of the effect of the fine-tuning, we show (3) that a small amount of curated data suffices to generate scene descriptions without losing the capability to identify object-level concepts in the scene; the model acquires a more holistic view of the image compared to when object-centric descriptions are generated. We discuss the parallels between these results and insights from computational and cognitive science research on scene perception.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::3abe94704c3e750a67ec2e2a6bb5daf2 http://arxiv.org/abs/2211.04971 Zobrazit plný text záznamu