Component Analysis for Visual Question Answering Architectures

Autor: Camila Kolling, Jonatas Wehrmann, Rodrigo C. Barros
Rok vydání: 2020
Předmět:
FOS: Computer and information sciences
Computer Science - Machine Learning
Computer science
Computer Vision and Pattern Recognition (cs.CV)
Feature extraction
Computer Science - Computer Vision and Pattern Recognition
02 engineering and technology
010501 environmental sciences
computer.software_genre
01 natural sciences
Machine Learning (cs.LG)
Knowledge extraction
Component (UML)
0202 electrical engineering
electronic engineering
information engineering

Question answering
Set (psychology)
0105 earth and related environmental sciences
Computer Science - Computation and Language
business.industry
Visualization
Task analysis
020201 artificial intelligence & image processing
Artificial intelligence
business
Computation and Language (cs.CL)
computer
Feature learning
Natural language processing
Natural language
Zdroj: IJCNN
DOI: 10.1109/ijcnn48605.2020.9206679
Popis: Recent research advances in Computer Vision and Natural Language Processing have introduced novel tasks that are paving the way for solving AI-complete problems. One of those tasks is called Visual Question Answering (VQA). A VQA system must take an image and a free-form, open-ended natural language question about the image, and produce a natural language answer as the output. Such a task has drawn great attention from the scientific community, which generated a plethora of approaches that aim to improve the VQA predictive accuracy. Most of them comprise three major components: (i) independent representation learning of images and questions; (ii) feature fusion so the model can use information from both sources to answer visual questions; and (iii) the generation of the correct answer in natural language. With so many approaches being recently introduced, it became unclear the real contribution of each component for the ultimate performance of the model. The main goal of this paper is to provide a comprehensive analysis regarding the impact of each component in VQA models. Our extensive set of experiments cover both visual and textual elements, as well as the combination of these representations in form of fusion and attention mechanisms. Our major contribution is to identify core components for training VQA models so as to maximize their predictive performance.
Databáze: OpenAIRE