Contrastive training of a multimodal encoder for medical visual question answering

Autor:	João Daniel Silva, Bruno Martins, João Magalhães
Jazyk:	angličtina
Rok vydání:	2023
Předmět:	Medical visual question answering Transformer encoders Vision and language Computer vision Natural language processing Cybernetics Q300-390 Electronic computers. Computer science QA75.5-76.95
Zdroj:	Intelligent Systems with Applications, Vol 18, Iss , Pp 200221- (2023)
Druh dokumentu:	article
ISSN:	2667-3053
DOI:	10.1016/j.iswa.2023.200221
Popis:	Models for Visual Question Answering (VQA) on medical images aim to answer diagnostically relevant natural language questions with basis on visual contents. In this article, we propose a novel approach to address this problem, which combines a strong image encoder based on EfficientNetV2 with a multimodal encoder based on the RealFormer architecture. Our model is pre-trained through a strategy that includes a contrastive objective, and the final fine-tuning to the VQA task uses a loss function that specifically addresses class imbalance. The experimental results confirm the effectiveness of our approach on the VQA-Med dataset from ImageCLEF 2019, showcasing the potential benefits of combining multimodal pre-training with recent advances in terms of neural network architectures.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/d078de70a57049f0b3f5ec06439d04b1 Zobrazit plný text záznamu View record in DOAJ