Zobrazeno 1 - 10
of 36
pro vyhledávání: '"Woo, SangMin"'
Autor:
Nugroho, Muhammad Adi, Woo, Sangmin, Lee, Sumin, Park, Jinyoung, Wang, Yooseung, Kim, Donguk, Kim, Changick
Weakly-Supervised Group Activity Recognition (WSGAR) aims to understand the activity performed together by a group of individuals with the video-level label and without actor-level labels. We propose Flow-Assisted Motion Learning Network (Flaming-Net
Externí odkaz:
http://arxiv.org/abs/2405.18012
Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models
This study addresses the issue observed in Large Vision Language Models (LVLMs), where excessive attention on a few image tokens, referred to as blind tokens, leads to hallucinatory responses in tasks requiring fine-grained understanding of visual ob
Externí odkaz:
http://arxiv.org/abs/2405.17820
We present Diffusion Model Patching (DMP), a simple method to boost the performance of pre-trained diffusion models that have already reached convergence, with a negligible increase in parameters. DMP inserts a small, learnable set of prompts into th
Externí odkaz:
http://arxiv.org/abs/2405.17825
Recent advancements in Large Vision Language Models (LVLMs) have revolutionized how machines understand and generate textual responses based on visual inputs. Despite their impressive capabilities, they often produce "hallucinatory" outputs that do n
Externí odkaz:
http://arxiv.org/abs/2405.17821
Panoramic Activity Recognition (PAR) seeks to identify diverse human activities across different scales, from individual actions to social group and global activities in crowded panoramic scenes. PAR presents two major challenges: 1) recognizing the
Externí odkaz:
http://arxiv.org/abs/2403.14113
Diffusion models have achieved remarkable success across a range of generative tasks. Recent efforts to enhance diffusion model architectures have reimagined them as a form of multi-task learning, where each task corresponds to a denoising task at a
Externí odkaz:
http://arxiv.org/abs/2403.09176
Recent progress in single-image 3D generation highlights the importance of multi-view coherency, leveraging 3D priors from large-scale diffusion models pretrained on Internet-scale images. However, the aspect of novel-view diversity remains underexpl
Externí odkaz:
http://arxiv.org/abs/2312.15980
Due to the distinctive characteristics of sensors, each modality exhibits unique physical properties. For this reason, in the context of multi-modal action recognition, it is important to consider not only the overall action content but also the comp
Externí odkaz:
http://arxiv.org/abs/2311.12344
Diffusion models generate highly realistic images by learning a multi-step denoising process, naturally embodying the principles of multi-task learning (MTL). Despite the inherent connection between diffusion models and MTL, there remains an unexplor
Externí odkaz:
http://arxiv.org/abs/2310.07138
Deep learning has made significant strides in video understanding tasks, but the computation required to classify lengthy and massive videos using clip-level video classifiers remains impractical and prohibitively expensive. To address this issue, we
Externí odkaz:
http://arxiv.org/abs/2308.09322