Multimodal feature fusion by relational reasoning and attention for visual question answering

Autor:	Hua Hu, Zengchang Qin, Haiyang Hu, Jing Yu, Weifeng Zhang
Rok vydání:	2020
Předmět:	Computer science business.industry Bilinear interpolation 020206 networking & telecommunications 02 engineering and technology computer.software_genre Image (mathematics) Discriminative model Hardware and Architecture Visual Objects Signal Processing 0202 electrical engineering electronic engineering information engineering Question answering Key (cryptography) Fuse (electrical) 020201 artificial intelligence & image processing Artificial intelligence business computer Software Natural language Natural language processing Information Systems computer.programming_language
Zdroj:	Information Fusion. 55:116-126
ISSN:	1566-2535
DOI:	10.1016/j.inffus.2019.08.009
Popis:	The recently emerged research of Visual Question Answering (VQA) has become a hot topic in computer vision. A key solution to VQA exists in how to fuse multimodal features extracted from image and question. In this paper, we show that combining visual relationship and attention together achieves more fine-grained feature fusion. Specifically, we design an effective and efficient module to reason complex relationship between visual objects. In addition, a bilinear attention module is learned for question guided attention on visual objects, which allows us to obtain more discriminative visual features. Given an image and a question in natural language, our VQA model learns visual relational reasoning network and attention network in parallel to fuse fine-grained textual and visual features, so that answers can be predicted accurately. Experimental results show that our approach achieves new state-of-the-art performance of single model on both VQA 1.0 and VQA 2.0 datasets.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::f09bb725036bf8590c928e9cc684474c https://doi.org/10.1016/j.inffus.2019.08.009 Zobrazit plný text záznamu Full Text from ScienceDirect