Výsledky vyhledávání

Report

Autor: Xue, Zihui, An, Joungbin, Yang, Xitong, Grauman, Kristen

While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level.

Externí odkaz: http://arxiv.org/abs/2412.02071

Zobrazit plný text záznamu

Report

FIction: 4D Future Interaction Prediction from Video

Autor: Ashutosh, Kumar, Pavlakos, Georgios, Grauman, Kristen

Anticipating how a person will interact with objects in an environment is essential for activity understanding, but existing methods are limited to the 2D space of video frames-capturing physically ungrounded predictions of 'what' and ignoring the 'w

Externí odkaz: http://arxiv.org/abs/2412.00932

Zobrazit plný text záznamu

Report

Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos

Autor: Majumder, Sagnik, Nagarajan, Tushar, Al-Halah, Ziad, Pradhan, Reina, Grauman, Kristen

Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive ``best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approa

Externí odkaz: http://arxiv.org/abs/2411.08753

Zobrazit plný text záznamu

Report

Human Action Anticipation: A Survey

Autor: Lai, Bolin, Toyer, Sam, Nagarajan, Tushar, Girdhar, Rohit, Zha, Shengxin, Rehg, James M., Kitani, Kris, Grauman, Kristen, Desai, Ruta, Liu, Miao

Predicting future human behavior is an increasingly popular topic in computer vision, driven by the interest in applications such as autonomous vehicles, digital assistants and human-robot interactions. The literature on behavior prediction spans var

Externí odkaz: http://arxiv.org/abs/2410.14045

Zobrazit plný text záznamu

Report

ExpertAF: Expert Actionable Feedback from Video

Autor: Ashutosh, Kumar, Nagarajan, Tushar, Pavlakos, Georgios, Kitani, Kris, Grauman, Kristen

Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the

Externí odkaz: http://arxiv.org/abs/2408.00672

Zobrazit plný text záznamu

Report

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

Autor: Chen, Changan, Peng, Puyuan, Baid, Ami, Xue, Zihui, Hsu, Wei-Ning, Harwath, David, Grauman, Kristen

Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training

Externí odkaz: http://arxiv.org/abs/2406.09272

Zobrazit plný text záznamu

Report

HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness

Autor: Xue, Zihui, Luo, Mi, Chen, Changan, Grauman, Kristen

We study the problem of precisely swapping objects in videos, with a focus on those interacted with by hands, given one user-provided reference object image. Despite the great advancements that diffusion models have made in video editing recently, th

Externí odkaz: http://arxiv.org/abs/2406.07754

Zobrazit plný text záznamu

Report

Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction

Autor: Chen, Changan, Ramos, Jordi, Tomar, Anshul, Grauman, Kristen

Sim2real transfer has received increasing attention lately due to the success of learning robotic tasks in simulation end-to-end. While there has been a lot of progress in transferring vision-based navigation policies, the existing sim2real strategy

Externí odkaz: http://arxiv.org/abs/2405.02821

Zobrazit plný text záznamu

Report

ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

Autor: Somayazulu, Arjun, Majumder, Sagnik, Chen, Changan, Grauman, Kristen

An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Traditional methods for constructing acoustic models involve expensive and time-consum

Externí odkaz: http://arxiv.org/abs/2404.16216

Zobrazit plný text záznamu

Report

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Autor: Chen, Changan, Ashutosh, Kumar, Girdhar, Rohit, Harwath, David, Grauman, Kristen

We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC

Externí odkaz: http://arxiv.org/abs/2404.05206

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání