Zobrazeno 1 - 10
of 1 342
pro vyhledávání: '"Grauman, A."'
While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level.
Externí odkaz:
http://arxiv.org/abs/2412.02071
Anticipating how a person will interact with objects in an environment is essential for activity understanding, but existing methods are limited to the 2D space of video frames-capturing physically ungrounded predictions of 'what' and ignoring the 'w
Externí odkaz:
http://arxiv.org/abs/2412.00932
Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive ``best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approa
Externí odkaz:
http://arxiv.org/abs/2411.08753
Autor:
Lai, Bolin, Toyer, Sam, Nagarajan, Tushar, Girdhar, Rohit, Zha, Shengxin, Rehg, James M., Kitani, Kris, Grauman, Kristen, Desai, Ruta, Liu, Miao
Predicting future human behavior is an increasingly popular topic in computer vision, driven by the interest in applications such as autonomous vehicles, digital assistants and human-robot interactions. The literature on behavior prediction spans var
Externí odkaz:
http://arxiv.org/abs/2410.14045
Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the
Externí odkaz:
http://arxiv.org/abs/2408.00672
Autor:
Chen, Changan, Peng, Puyuan, Baid, Ami, Xue, Zihui, Hsu, Wei-Ning, Harwath, David, Grauman, Kristen
Generating realistic audio for human actions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training
Externí odkaz:
http://arxiv.org/abs/2406.09272
We study the problem of precisely swapping objects in videos, with a focus on those interacted with by hands, given one user-provided reference object image. Despite the great advancements that diffusion models have made in video editing recently, th
Externí odkaz:
http://arxiv.org/abs/2406.07754
Sim2real transfer has received increasing attention lately due to the success of learning robotic tasks in simulation end-to-end. While there has been a lot of progress in transferring vision-based navigation policies, the existing sim2real strategy
Externí odkaz:
http://arxiv.org/abs/2405.02821
An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Traditional methods for constructing acoustic models involve expensive and time-consum
Externí odkaz:
http://arxiv.org/abs/2404.16216
We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC
Externí odkaz:
http://arxiv.org/abs/2404.05206