VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

Autor:	Florian Metze, Prahal Arora, Luke Zettlemoyer, Gargi Ghosh, Hu Xu, Po-Yao Huang, Christoph Feichtenhofer, Masoumeh Aminzadeh
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	Masking (art) FOS: Computer and information sciences Modalities Forcing (recursion theory) Computer Science - Computation and Language Computer science Speech recognition Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Multi-task learning Task (computing) Range (mathematics) Language model Encoder Computation and Language (cs.CL)
Zdroj:	ACL/IJCNLP (Findings)
Popis:	We present a simplified, task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks. Existing pre-training are task-specific by adopting either a single cross-modal encoder that requires both modalities, limiting their use for retrieval-style end tasks or more complex multitask learning with two unimodal encoders, limiting early cross-modal fusion. We instead introduce new pretraining masking schemes that better mix across modalities (e.g. by forcing masks for text to predict the closest video embeddings) while also maintaining separability (e.g. unimodal predictions are sometimes required, without using all the input). Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training. Code is made available at https://github.com/pytorch/fairseq/tree/main/examples/MMPT. 9 pages, ACL Findings 2021
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::b7251a0fe79f15d5d5f1c32029890b5f http://arxiv.org/abs/2105.09996 Zobrazit plný text záznamu