Výsledky vyhledávání - "Shi, Baifeng"

Report

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

Autor: Niu, Dantong, Sharma, Yuvan, Biamby, Giscard, Quenum, Jerome, Bai, Yutong, Shi, Baifeng, Darrell, Trevor, Herzig, Roei

In recent years, instruction-tuned Large Multimodal Models (LMMs) have been successful at several tasks, including image captioning and visual question answering; yet leveraging these models remains an open question for robotics. Prior LMMs for robot

Externí odkaz: http://arxiv.org/abs/2406.11815

Zobrazit plný text záznamu

Report

When Do We Not Need Larger Vision Models?

Autor: Shi, Baifeng, Wu, Ziyang, Mao, Maolin, Wang, Xin, Darrell, Trevor

Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations. In this work, we discuss the point beyond which larger vision models are not necessary. First, we demonstrate the power of Scaling on

Externí odkaz: http://arxiv.org/abs/2403.13043

Zobrazit plný text záznamu

Report

Humanoid Locomotion as Next Token Prediction

Autor: Radosavovic, Ilija, Zhang, Bike, Shi, Baifeng, Rajasegaran, Jathushan, Kamat, Sarthak, Darrell, Trevor, Sreenath, Koushil, Malik, Jitendra

We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal

Externí odkaz: http://arxiv.org/abs/2402.19469

Zobrazit plný text záznamu

Report

Rethinking Patch Dependence for Masked Autoencoders

Autor: Fu, Letian, Lian, Long, Wang, Renhao, Shi, Baifeng, Wang, Xudong, Yala, Adam, Darrell, Trevor, Efros, Alexei A., Goldberg, Ken

In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations sugge

Externí odkaz: http://arxiv.org/abs/2401.14391

Zobrazit plný text záznamu

Report

Recursive Visual Programming

Autor: Ge, Jiaxin, Subramanian, Sanjay, Shi, Baifeng, Herzig, Roei, Darrell, Trevor

Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in

Externí odkaz: http://arxiv.org/abs/2312.02249

Zobrazit plný text záznamu

Report

LLM-grounded Video Diffusion Models

Autor: Lian, Long, Shi, Baifeng, Yala, Adam, Darrell, Trevor, Li, Boyi

Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion. To address these limitatio

Externí odkaz: http://arxiv.org/abs/2309.17444

Zobrazit plný text záznamu

Report

Robot Learning with Sensorimotor Pre-training

Autor: Radosavovic, Ilija, Shi, Baifeng, Fu, Letian, Goldberg, Ken, Darrell, Trevor, Malik, Jitendra

We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. Given a sequence of camera images, proprioceptive robot states, and actions, we e

Externí odkaz: http://arxiv.org/abs/2306.10007

Zobrazit plný text záznamu

Report

TOAST: Transfer Learning via Attention Steering

Autor: Shi, Baifeng, Gai, Siyu, Darrell, Trevor, Wang, Xin

Transfer learning involves adapting a pre-trained model to novel downstream tasks. However, we observe that current transfer learning methods often fail to focus on task-relevant features. In this work, we explore refocusing model attention for trans

Externí odkaz: http://arxiv.org/abs/2305.15542

Zobrazit plný text záznamu

Report

Top-Down Visual Attention from Analysis by Synthesis

Autor: Shi, Baifeng, Darrell, Trevor, Wang, Xin

Current attention algorithms (e.g., self-attention) are stimulus-driven and highlight all the salient objects in an image. However, intelligent agents like humans often guide their attention based on the high-level task at hand, focusing only on task

Externí odkaz: http://arxiv.org/abs/2303.13043

Zobrazit plný text záznamu

Report

Visual Attention Emerges from Recurrent Sparse Reconstruction

Autor: Shi, Baifeng, Song, Yale, Joshi, Neel, Darrell, Trevor, Wang, Xin

Visual attention helps achieve robust perception under noise, corruption, and distribution shifts in human vision, which are areas where modern neural networks still fall short. We present VARS, Visual Attention from Recurrent Sparse reconstruction,

Externí odkaz: http://arxiv.org/abs/2204.10962

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání