Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training
Autor: | Weituo Hao, Xiujun Li, Lawrence Carin, Jianfeng Gao, Chunyuan Li |
---|---|
Rok vydání: | 2020 |
Předmět: |
FOS: Computer and information sciences
Computer Science - Machine Learning Computer science Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition 02 engineering and technology 010501 environmental sciences 01 natural sciences Machine Learning (cs.LG) Task (project management) Computer Science - Robotics Human–computer interaction 0202 electrical engineering electronic engineering information engineering Representation (mathematics) 0105 earth and related environmental sciences Computer Science - Computation and Language business.industry Visualization Variable (computer science) Task analysis Trajectory Benchmark (computing) 020201 artificial intelligence & image processing State (computer science) Artificial intelligence business Computation and Language (cs.CL) Robotics (cs.RO) |
Zdroj: | CVPR |
DOI: | 10.1109/cvpr42600.2020.01315 |
Popis: | Learning to navigate in a visual environment following natural-language instructions is a challenging task, because the multimodal inputs to the agent are highly variable, and the training data on a new task is often limited. In this paper, we present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks. By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions. It can be easily used as a drop-in for existing VLN frameworks, leading to the proposed agent called Prevalent. It learns more effectively in new tasks and generalizes better in a previously unseen environment. The performance is validated on three VLN tasks. On the Room-to-Room benchmark, our model improves the state-of-the-art from 47% to 51% on success rate weighted by path length. Further, the learned representation is transferable to other VLN tasks. On two recent tasks, vision-and-dialog navigation and "Help, Anna!" the proposed Prevalent leads to significant improvement over existing methods, achieving a new state of the art. Comment: To appear at CVPR 2020. The first two authors contributed equally to this manuscript. Code: https://github.com/weituo12321/PREVALENT |
Databáze: | OpenAIRE |
Externí odkaz: |