Zobrazeno 1 - 10
of 41
pro vyhledávání: '"HUANG Jiabo"'
Visual-textual correlations in the attention maps derived from text-to-image diffusion models are proven beneficial to dense visual prediction tasks, e.g., semantic segmentation. However, a significant challenge arises due to the input distributional
Externí odkaz:
http://arxiv.org/abs/2410.11473
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a f
Externí odkaz:
http://arxiv.org/abs/2406.17880
Video moment retrieval (VMR) is to search for a visual temporal moment in an untrimmed raw video by a given text query description (sentence). Existing studies either start from collecting exhaustive frame-wise annotations on the temporal boundary of
Externí odkaz:
http://arxiv.org/abs/2406.01791
The remarkable capability of large language models (LLMs) in generating high-quality code has drawn increasing attention in the software testing community. However, existing code LLMs often demonstrate unsatisfactory capabilities in generating accura
Externí odkaz:
http://arxiv.org/abs/2402.03396
Video Moment Retrieval (VMR) requires precise modelling of fine-grained moment-text associations to capture intricate visual-language relationships. Due to the lack of a diverse and generalisable VMR dataset to facilitate learning scalable moment-tex
Externí odkaz:
http://arxiv.org/abs/2401.13329
Large language models (LLMs) for natural language processing have been grafted onto programming language modeling for advancing code intelligence. Although it can be represented in the text format, code is syntactically more rigorous in order to be p
Externí odkaz:
http://arxiv.org/abs/2309.09980
Accurate video moment retrieval (VMR) requires universal visual-textual correlations that can handle unknown vocabulary and unseen scenes. However, the learned correlations are likely either biased when derived from a limited amount of moment-text da
Externí odkaz:
http://arxiv.org/abs/2309.00661
The correlation between the vision and text is essential for video moment retrieval (VMR), however, existing methods heavily rely on separate pre-training feature extractors for visual and textual understanding. Without sufficient temporal boundary a
Externí odkaz:
http://arxiv.org/abs/2303.00040
Current methods for video activity localisation over time assume implicitly that activity temporal boundaries labelled for model training are determined and precise. However, in unscripted natural videos, different activities mostly transit smoothly,
Externí odkaz:
http://arxiv.org/abs/2206.12923
Person Re-identification (ReID) has been advanced remarkably over the last 10 years along with the rapid development of deep learning for visual recognition. However, the i.i.d. (independent and identically distributed) assumption commonly held in mo
Externí odkaz:
http://arxiv.org/abs/2205.11197