Zobrazeno 1 - 10
of 10
pro vyhledávání: '"Beedu, Apoorva"'
Foundational models are able to generate text outputs given prompt instructions and text, audio, or image inputs. Recently these models have been combined to perform tasks on video, such as video summarization. Such video foundation models perform pr
Externí odkaz:
http://arxiv.org/abs/2410.07405
Video Language Models (VLMs) are crucial for generalizing across diverse tasks and using language cues to enhance learning. While transformer-based architectures have been the de facto in vision-language training, they face challenges like quadratic
Externí odkaz:
http://arxiv.org/abs/2409.11513
Autor:
Haresamudram, Harish, Beedu, Apoorva, Rabbi, Mashfiqui, Saha, Sankalita, Essa, Irfan, Ploetz, Thomas
Cross-modal contrastive pre-training between natural language and other modalities, e.g., vision and audio, has demonstrated astonishing performance and effectiveness across a diverse variety of tasks and domains. In this paper, we investigate whethe
Externí odkaz:
http://arxiv.org/abs/2408.12023
Anticipating future actions is a highly challenging task due to the diversity and scale of potential future actions; yet, information from different modalities help narrow down plausible action choices. Each modality can provide diverse and often com
Externí odkaz:
http://arxiv.org/abs/2401.12972
Human Activity Recognition (HAR) systems have been extensively studied by the vision and ubiquitous computing communities due to their practical applications in daily life, such as smart homes, surveillance, and health monitoring. Typically, this pro
Externí odkaz:
http://arxiv.org/abs/2309.01262
To properly assist humans in their needs, human activity recognition (HAR) systems need the ability to fuse information from multiple modalities. Our hypothesis is that multimodal sensors, visual and non-visual tend to provide complementary informati
Externí odkaz:
http://arxiv.org/abs/2211.04331
Video-based dialog task is a challenging multimodal learning task that has received increasing attention over the past few years with state-of-the-art obtaining new performance records. This progress is largely powered by the adaptation of the more p
Externí odkaz:
http://arxiv.org/abs/2210.14512
We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our approach lev
Externí odkaz:
http://arxiv.org/abs/2210.13540
We introduce a simple yet effective algorithm that uses convolutional neural networks to directly estimate object poses from videos. Our approach leverages the temporal information from a video sequence, and is computationally efficient and robust to
Externí odkaz:
http://arxiv.org/abs/2111.10677
Publikováno v:
2015 IEEE International Conference on Electronics, Computing & Communication Technologies (CONECCT); 2015, p1-6, 6p