Zobrazeno 1 - 10
of 20
pro vyhledávání: '"Narasimhan, Medhini"'
Autor:
Subramanian, Sanjay, Narasimhan, Medhini, Khangaonkar, Kushal, Yang, Kevin, Nagrani, Arsha, Schmid, Cordelia, Zeng, Andy, Darrell, Trevor, Klein, Dan
We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models (LMs), visual
Externí odkaz:
http://arxiv.org/abs/2306.05392
Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantic
Externí odkaz:
http://arxiv.org/abs/2303.13519
Autor:
Narasimhan, Medhini, Nagrani, Arsha, Sun, Chen, Rubinstein, Michael, Darrell, Trevor, Rohrbach, Anna, Schmid, Cordelia
YouTube users looking for instructions for a specific task may spend a long time browsing content trying to find the right video that matches their needs. Creating a visual summary (abridged version of a video) provides viewers with a quick overview
Externí odkaz:
http://arxiv.org/abs/2208.06773
We propose a novel framework for multi-person 3D motion trajectory prediction. Our key observation is that a human's action and behaviors may highly depend on the other persons around. Thus, instead of predicting each human pose trajectory in isolati
Externí odkaz:
http://arxiv.org/abs/2111.12073
Publikováno v:
Thirty-Fifth Conference on Neural Information Processing Systems. 2021
A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes. Yet the importance of scenes in a video is often subjective, and users should have the option of customizing the summary by
Externí odkaz:
http://arxiv.org/abs/2107.00650
We introduce a non-parametric approach for infinite video texture synthesis using a representation learned via contrastive learning. We take inspiration from Video Textures, which showed that plausible new videos could be generated from a single one
Externí odkaz:
http://arxiv.org/abs/2104.02687
Autor:
Narasimhan, Medhini, Wijmans, Erik, Chen, Xinlei, Darrell, Trevor, Batra, Dhruv, Parikh, Devi, Singh, Amanpreet
We introduce a learning-based approach for room navigation using semantic maps. Our proposed architecture learns to predict top-down belief maps of regions that lie beyond the agent's field of view while modeling architectural and stylistic regularit
Externí odkaz:
http://arxiv.org/abs/2007.09841
Accurately answering a question about a given image requires combining observations with general knowledge. While this is effortless for humans, reasoning with general knowledge remains an algorithmic challenge. To advance research in this direction
Externí odkaz:
http://arxiv.org/abs/1811.00538
Question answering is an important task for autonomous agents and virtual assistants alike and was shown to support the disabled in efficiently navigating an overwhelming environment. Many existing methods focus on observation-based questions, ignori
Externí odkaz:
http://arxiv.org/abs/1809.01124
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.