Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Autor:	Yang, Antoine, Miech, Antoine, Sivic, Josef, Laptev, Ivan, Schmid, Cordelia
Přispěvatelé:	Models of visual object recognition and scene understanding (WILLOW), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), DeepMind [London], DeepMind Technologies, Czech Institute of Informatics, Robotics and Cybernetics [Prague] (CIIRC), Czech Technical University in Prague (CTU), This work was granted access to the HPC resources of IDRIS under the allocation 2020-101267 made by GENCI. The work was funded by a Google gift, the French government under management of Agence Nationale de la Recherche as part of the 'Investissements d’avenir' program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute), the Louis Vuitton ENS Chair on Artificial Intelligence, the European Regional Development Fund under project IMPACT (reg. n° CZ.02.1.01/0.0/0.0/15 003/0000468) and Antoine Miech’s Google PhD fellowship., ANR-19-P3IA-0001,PRAIRIE,PaRis Artificial Intelligence Research InstitutE(2019), Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Département d'informatique - ENS Paris (DI-ENS), Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), DeepMind, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)
Rok vydání:	2021
Předmět:	FOS: Computer and information sciences Computer Science - Machine Learning Computer Science - Computation and Language [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV] [INFO]Computer Science [cs] Computation and Language (cs.CL) [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] Machine Learning (cs.LG)
Zdroj:	ICCV 2021-IEEE International Conference on Computer Vision ICCV 2021-IEEE International Conference on Computer Vision, Oct 2021, Montréal, Canada ICCV 2021-IEEE International Conference on Computer Vision, Oct 2021, Virtual, France
Popis:	Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations. Our code, datasets and trained models are available at https://antoyang.github.io/just-ask.html. Accepted at ICCV 2021 (Oral); 20 pages; 14 figures
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::1466e13a43d9fdd751f26606975965d4 https://doi.org/10.1109/iccv48922.2021.00171 Zobrazit plný text záznamu