Zobrazeno 1 - 10
of 281
pro vyhledávání: '"Laptev, Ivan A."'
Recent Large Vision Language Models (LVLMs) present remarkable zero-shot conversational and reasoning capabilities given multimodal queries. Nevertheless, they suffer from object hallucination, a phenomenon where LVLMs are prone to generate textual r
Externí odkaz:
http://arxiv.org/abs/2410.15926
Contact planning for legged robots in extremely constrained environments is challenging. The main difficulty stems from the mixed nature of the problem, discrete search together with continuous trajectory optimization. To speed up the discrete search
Externí odkaz:
http://arxiv.org/abs/2407.11788
Recent advances in vision-language models have significantly propelled video understanding. Existing datasets and tasks, however, have notable limitations. Most datasets are confined to short videos with limited events and narrow narratives. For exam
Externí odkaz:
http://arxiv.org/abs/2406.10221
Autor:
Fares, Samar, Ziu, Klea, Aremu, Toluwani, Durasov, Nikita, Takáč, Martin, Fua, Pascal, Nandakumar, Karthik, Laptev, Ivan
Vision-Language Models (VLMs) are becoming increasingly vulnerable to adversarial attacks as various novel attack strategies are being proposed against these models. While existing defenses excel in unimodal contexts, they currently fall short in saf
Externí odkaz:
http://arxiv.org/abs/2406.09250
In this work, we aim to learn a unified vision-based policy for multi-fingered robot hands to manipulate a variety of objects in diverse poses. Though prior work has shown benefits of using human videos for policy learning, performance gains have bee
Externí odkaz:
http://arxiv.org/abs/2404.15709
Learning generalizable visual representations from Internet data has yielded promising results for robotics. Yet, prevailing approaches focus on pre-training 2D representations, being sub-optimal to deal with occlusions and accurately localize object
Externí odkaz:
http://arxiv.org/abs/2404.01491
We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation, our generated images preserve the envi
Externí odkaz:
http://arxiv.org/abs/2312.07322
The ability for robots to comprehend and execute manipulation tasks based on natural language instructions is a long-term goal in robotics. The dominant approaches for language-guided manipulation use 2D image representations, which face difficulties
Externí odkaz:
http://arxiv.org/abs/2309.15596
Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present VidChapters-7M, a
Externí odkaz:
http://arxiv.org/abs/2309.13952
Object goal navigation aims to navigate an agent to locations of a given object category in unseen environments. Classical methods explicitly build maps of environments and require extensive engineering while lacking semantic information for object-o
Externí odkaz:
http://arxiv.org/abs/2308.05602