HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Autor:	Jean-Baptiste Alayrac, Josef Sivic, Antoine Miech, Ivan Laptev, Dimitri Zhukov, Makarand Tapaswi
Přispěvatelé:	Models of visual object recognition and scene understanding (WILLOW), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), Laboratoire d'informatique de l'école normale supérieure (LIENS), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Czech Institute of Informatics, Robotics and Cybernetics [Prague] (CIIRC), Czech Technical University in Prague (CTU), The project was partially supported by Antoine Miech Google PhD fellowship, the MSR-Inria joint lab, the Louis Vuitton - ENS Chair on Artificial Intelligence, the ERC grant LEAP (No. 336845), the CIFAR Learning in Machines & Brains program, and the European Regional Development Fund under the project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15003/0000468)., Miech, Antoine, Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Département d'informatique - ENS Paris (DI-ENS), Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)
Jazyk:	angličtina
Rok vydání:	2019
Předmět:	FOS: Computer and information sciences Information retrieval Computer science business.industry Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV] 02 engineering and technology 010501 environmental sciences 01 natural sciences Visualization [INFO.INFO-CV] Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV] 0202 electrical engineering electronic engineering information engineering Task analysis Code (cryptography) Embedding 020201 artificial intelligence & image processing Artificial intelligence CLIPS business computer Natural language 0105 earth and related environmental sciences computer.programming_language
Zdroj:	ICCV 2019-International Conference on Computer Vision ICCV 2019-International Conference on Computer Vision, Oct 2019, Séoul, South Korea ICCV
Popis:	Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/. Accepted at ICCV 2019
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::720f44a13124db3df7f8d3e8279c7bc2 https://hal.science/hal-02433497 Zobrazit plný text záznamu