Zobrazeno 1 - 10
of 38
pro vyhledávání: '"Rawat, Yogesh S"'
Geometric understanding is crucial for navigating and interacting with our environment. While large Vision Language Models (VLMs) demonstrate impressive capabilities, deploying them in real-world scenarios necessitates a comparable geometric understa
Externí odkaz:
http://arxiv.org/abs/2408.11748
Illustration is a fundamental mode of human expression and communication. Certain types of motion that accompany speech can provide this illustrative mode of communication. While Augmented and Virtual Reality technologies (AR/VR) have introduced tool
Externí odkaz:
http://arxiv.org/abs/2407.08906
Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs achieve this by capturing robust and generic features from video data. Thi
Externí odkaz:
http://arxiv.org/abs/2405.03770
In this work we present a novel task of understanding unintentional human activities in videos. We formalize this problem as a reasoning task under zero-shot scenario, where given a video of an unintentional activity we want to know why it transition
Externí odkaz:
http://arxiv.org/abs/2402.19405
Recent advancements in large-scale pre-training of visual-language models on paired image-text data have demonstrated impressive generalization capabilities for zero-shot tasks. Building on this success, efforts have been made to adapt these image-ba
Externí odkaz:
http://arxiv.org/abs/2312.08010
Autor:
Schiappa, Madeline Chantry, Azad, Shehreen, VS, Sachidanand, Ge, Yunhao, Miksik, Ondrej, Rawat, Yogesh S., Vineet, Vibhav
Due to the increase in computational resources and accessibility of data, an increase in large, deep learning models trained on copious amounts of multi-modal data using self-supervised or semi-supervised learning have emerged. These ``foundation'' m
Externí odkaz:
http://arxiv.org/abs/2306.09278
In this work, we focus on generating graphical representations of noisy, instructional videos for video understanding. We propose a self-supervised, interpretable approach that does not require any annotations for graphical representations, which wou
Externí odkaz:
http://arxiv.org/abs/2207.08001
Publikováno v:
Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (2022)
Joint visual and language modeling on large-scale datasets has recently shown good progress in multi-modal tasks when compared to single modal learning. However, robustness of these approaches against real-world perturbations has not been studied. In
Externí odkaz:
http://arxiv.org/abs/2207.02159
Publikováno v:
ACM Comput. Surv. (December 2022)
The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the
Externí odkaz:
http://arxiv.org/abs/2207.00419
We present LARNet, a novel end-to-end approach for generating human action videos. A joint generative modeling of appearance and dynamics to synthesize a video is very challenging and therefore recent works in video synthesis have proposed to decompo
Externí odkaz:
http://arxiv.org/abs/2110.10899