Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval

Autor:	Ju, Yeong-Joon, Kim, Ho-Joong, Lee, Seong-Whan
Rok vydání:	2024
Předmět:	Computer Science - Computer Vision and Pattern Recognition Computer Science - Artificial Intelligence Computer Science - Information Retrieval Computer Science - Multimedia
Druh dokumentu:	Working Paper
Popis:	Existing multimodal retrieval systems often rely on disjointed models for image comprehension, such as object detectors and caption generators, leading to cumbersome implementations and training processes. To overcome this limitation, we propose an end-to-end retrieval system, Ret-XKnow, to endow a text retriever with the ability to understand multimodal queries via dynamic modality interaction. Ret-XKnow leverages a partial convolution mechanism to focus on visual information relevant to the given textual query, thereby enhancing multimodal query representations. To effectively learn multimodal interaction, we also introduce the Visual Dialogue-to-Retrieval (ViD2R) dataset automatically constructed from visual dialogue datasets. Our dataset construction process ensures that the dialogues are transformed into suitable information retrieval tasks using a text retriever. We demonstrate that our approach not only significantly improves retrieval performance in zero-shot settings but also achieves substantial improvements in fine-tuning scenarios. Our code is publicly available: https://github.com/yeongjoonJu/Ret_XKnow.
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2411.08334 Zobrazit plný text záznamu View this record from Arxiv