Zobrazeno 1 - 10
of 23
pro vyhledávání: '"Shi, Baifeng"'
Autor:
Niu, Dantong, Sharma, Yuvan, Biamby, Giscard, Quenum, Jerome, Bai, Yutong, Shi, Baifeng, Darrell, Trevor, Herzig, Roei
In recent years, instruction-tuned Large Multimodal Models (LMMs) have been successful at several tasks, including image captioning and visual question answering; yet leveraging these models remains an open question for robotics. Prior LMMs for robot
Externí odkaz:
http://arxiv.org/abs/2406.11815
Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations. In this work, we discuss the point beyond which larger vision models are not necessary. First, we demonstrate the power of Scaling on
Externí odkaz:
http://arxiv.org/abs/2403.13043
Autor:
Radosavovic, Ilija, Zhang, Bike, Shi, Baifeng, Rajasegaran, Jathushan, Kamat, Sarthak, Darrell, Trevor, Sreenath, Koushil, Malik, Jitendra
We cast real-world humanoid control as a next token prediction problem, akin to predicting the next word in language. Our model is a causal transformer trained via autoregressive prediction of sensorimotor trajectories. To account for the multi-modal
Externí odkaz:
http://arxiv.org/abs/2402.19469
Autor:
Fu, Letian, Lian, Long, Wang, Renhao, Shi, Baifeng, Wang, Xudong, Yala, Adam, Darrell, Trevor, Efros, Alexei A., Goldberg, Ken
In this work, we re-examine inter-patch dependencies in the decoding mechanism of masked autoencoders (MAE). We decompose this decoding mechanism for masked patch reconstruction in MAE into self-attention and cross-attention. Our investigations sugge
Externí odkaz:
http://arxiv.org/abs/2401.14391
Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities, especially in
Externí odkaz:
http://arxiv.org/abs/2312.02249
Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion. To address these limitatio
Externí odkaz:
http://arxiv.org/abs/2309.17444
Autor:
Radosavovic, Ilija, Shi, Baifeng, Fu, Letian, Goldberg, Ken, Darrell, Trevor, Malik, Jitendra
We present a self-supervised sensorimotor pre-training approach for robotics. Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens. Given a sequence of camera images, proprioceptive robot states, and actions, we e
Externí odkaz:
http://arxiv.org/abs/2306.10007
Transfer learning involves adapting a pre-trained model to novel downstream tasks. However, we observe that current transfer learning methods often fail to focus on task-relevant features. In this work, we explore refocusing model attention for trans
Externí odkaz:
http://arxiv.org/abs/2305.15542
Current attention algorithms (e.g., self-attention) are stimulus-driven and highlight all the salient objects in an image. However, intelligent agents like humans often guide their attention based on the high-level task at hand, focusing only on task
Externí odkaz:
http://arxiv.org/abs/2303.13043
Visual attention helps achieve robust perception under noise, corruption, and distribution shifts in human vision, which are areas where modern neural networks still fall short. We present VARS, Visual Attention from Recurrent Sparse reconstruction,
Externí odkaz:
http://arxiv.org/abs/2204.10962