Vision–Language Model for Visual Question Answering in Medical Imagery

Autor:	Yakoub Bazi, Mohamad Mahmoud Al Rahhal, Laila Bashmal, Mansour Zuair
Jazyk:	angličtina
Rok vydání:	2023
Předmět:	medical visual question answering vision–language encoders transformer Technology Biology (General) QH301-705.5
Zdroj:	Bioengineering, Vol 10, Iss 3, p 380 (2023)
Druh dokumentu:	article
ISSN:	2306-5354
DOI:	10.3390/bioengineering10030380
Popis:	In the clinical and healthcare domains, medical images play a critical role. A mature medical visual question answering system (VQA) can improve diagnosis by answering clinical questions presented with a medical image. Despite its enormous potential in the healthcare industry and services, this technology is still in its infancy and is far from practical use. This paper introduces an approach based on a transformer encoder–decoder architecture. Specifically, we extract image features using the vision transformer (ViT) model, and we embed the question using a textual encoder transformer. Then, we concatenate the resulting visual and textual representations and feed them into a multi-modal decoder for generating the answer in an autoregressive way. In the experiments, we validate the proposed model on two VQA datasets for radiology images termed VQA-RAD and PathVQA. The model shows promising results compared to existing solutions. It yields closed and open accuracies of 84.99% and 72.97%, respectively, for VQA-RAD, and 83.86% and 62.37%, respectively, for PathVQA. Other metrics such as the BLUE score showing the alignment between the predicted and true answer sentences are also reported.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/043e97016bc343d78b593a5b451e4eb3 Zobrazit plný text záznamu View record in DOAJ Plný text ve formátu PDF Plný text ve formátu HTML
Nepřihlášeným uživatelům se plný text nezobrazuje	K zobrazení výsledku je třeba se přihlásit.