Zobrazeno 1 - 10
of 158
pro vyhledávání: '"Wong, YongKang"'
This technical report introduces our top-ranked solution that employs two approaches, \ie suffix injection and projected gradient descent (PGD) , to address the TiFA workshop MLLM attack challenge. Specifically, we first append the text from an incor
Externí odkaz:
http://arxiv.org/abs/2412.15614
The creation of 4D avatars (i.e., animated 3D avatars) from text description typically uses text-to-image (T2I) diffusion models to synthesize 3D avatars in the canonical space and subsequently applies animation with target motions. However, such an
Externí odkaz:
http://arxiv.org/abs/2406.04629
Recent advancements in image understanding have benefited from the extensive use of web image-text pairs. However, video understanding remains a challenge despite the availability of substantial web video-text data. This difficulty primarily arises f
Externí odkaz:
http://arxiv.org/abs/2405.13911
Autor:
Cheng, Yi, Xu, Ziwei, Lin, Dongyun, Cheng, Harry, Wong, Yongkang, Sun, Ying, Lim, Joo Hwee, Kankanhalli, Mohan
For visual content generation, discrepancies between user intentions and the generated content have been a longstanding problem. This discrepancy arises from two main factors. First, user intentions are inherently complex, with subtle details not ful
Externí odkaz:
http://arxiv.org/abs/2405.12538
The rapid adoption of text-to-image diffusion models in society underscores an urgent need to address their biases. Without interventions, these biases could propagate a skewed worldview and restrict opportunities for minority groups. In this work, w
Externí odkaz:
http://arxiv.org/abs/2311.07604
Learning a versatile language-image model is computationally prohibitive under a limited computing budget. This paper delves into the \emph{efficient language-image pre-training}, an area that has received relatively little attention despite its impo
Externí odkaz:
http://arxiv.org/abs/2309.16738
The objective of the multi-condition human motion synthesis task is to incorporate diverse conditional inputs, encompassing various forms like text, music, speech, and more. This endows the task with the capability to adapt across multiple scenarios,
Externí odkaz:
http://arxiv.org/abs/2309.03031
Autor:
Cheng, Yi, Xu, Ziwei, Fang, Fen, Lin, Dongyun, Fan, Hehe, Wong, Yongkang, Sun, Ying, Kankanhalli, Mohan
In this technical report, we present our findings from a study conducted on the EPIC-KITCHENS-100 Unsupervised Domain Adaptation task for Action Recognition. Our research focuses on the innovative application of a differentiable logic loss in the tra
Externí odkaz:
http://arxiv.org/abs/2307.06569
Detecting Human-Object Interaction (HOI) in images is an important step towards high-level visual comprehension. Existing work often shed light on improving either human and object detection, or interaction recognition. However, due to the limitation
Externí odkaz:
http://arxiv.org/abs/2207.02400
Human-Object Interaction (HOI) detection has received considerable attention in the context of scene understanding. Despite the growing progress on benchmarks, we realize that existing methods often perform unsatisfactorily on distant interactions, w
Externí odkaz:
http://arxiv.org/abs/2207.01869