Zobrazeno 1 - 10
of 239
pro vyhledávání: '"Agrawal, Harsh"'
Autor:
Li, Zhangheng, You, Keen, Zhang, Haotian, Feng, Di, Agrawal, Harsh, Li, Xiujun, Moorthy, Mohana Prasad Sathya, Nichols, Jeff, Yang, Yinfei, Gan, Zhe
Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal large la
Externí odkaz:
http://arxiv.org/abs/2410.18967
Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with t
Externí odkaz:
http://arxiv.org/abs/2406.07904
Autor:
Szot, Andrew, Schwarzer, Max, Agrawal, Harsh, Mazoure, Bogdan, Talbott, Walter, Metcalf, Katherine, Mackraz, Natalie, Hjelm, Devon, Toshev, Alexander
We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks. Our approach, called Large LAnguage model Reinforcement Learning Policy (LLaRP), adapts a pre-trained frozen LLM to take as input text in
Externí odkaz:
http://arxiv.org/abs/2310.17722
Autor:
Kant, Yash, Ramachandran, Arun, Yenamandra, Sriram, Gilitschenski, Igor, Batra, Dhruv, Szot, Andrew, Agrawal, Harsh
We introduce Housekeep, a benchmark to evaluate commonsense reasoning in the home for embodied AI. In Housekeep, an embodied agent must tidy a house by rearranging misplaced objects without explicit instructions specifying which objects need to be re
Externí odkaz:
http://arxiv.org/abs/2205.10712
Autor:
Koh, Jing Yu, Agrawal, Harsh, Batra, Dhruv, Tucker, Richard, Waters, Austin, Lee, Honglak, Yang, Yinfei, Baldridge, Jason, Anderson, Peter
We study the problem of synthesizing immersive 3D indoor scenes from one or more images. Our aim is to generate high-resolution images and videos from novel viewpoints, including viewpoints that extrapolate far beyond the input images while maintaini
Externí odkaz:
http://arxiv.org/abs/2204.02960
Publikováno v:
In Experimental Hematology August 2024 136
Natural language instructions for visual navigation often use scene descriptions (e.g., "bedroom") and object references (e.g., "green chairs") to provide a breadcrumb trail to a goal location. This work presents a transformer-based vision-and-langua
Externí odkaz:
http://arxiv.org/abs/2110.14143
It is fundamental for personal robots to reliably navigate to a specified goal. To study this task, PointGoal navigation has been introduced in simulated Embodied AI environments. Recent advances solve this PointGoal navigation task with near-perfect
Externí odkaz:
http://arxiv.org/abs/2108.11550
Recent Visual Question Answering (VQA) models have shown impressive performance on the VQA benchmark but remain sensitive to small linguistic variations in input questions. Existing approaches address this by augmenting the dataset with question para
Externí odkaz:
http://arxiv.org/abs/2010.06087
Autor:
Kant, Yash, Batra, Dhruv, Anderson, Peter, Schwing, Alex, Parikh, Devi, Lu, Jiasen, Agrawal, Harsh
Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limite
Externí odkaz:
http://arxiv.org/abs/2007.12146