Zobrazeno 1 - 10
of 1 414
pro vyhledávání: '"Wang, Shuhui"'
Publikováno v:
IEEE Transactions on Circuits and Systems for Video Technology, 2024
Video Question Answering (VideoQA) represents a crucial intersection between video understanding and language processing, requiring both discriminative unimodal comprehension and sophisticated cross-modal interaction for accurate inference. Despite a
Externí odkaz:
http://arxiv.org/abs/2410.09380
Designing effective graph neural networks (GNNs) with message passing has two fundamental challenges, i.e., determining optimal message-passing pathways and designing local aggregators. Previous methods of designing optimal pathways are limited with
Externí odkaz:
http://arxiv.org/abs/2407.18480
Autor:
Wang, Shuhui, Sun, Zihan, Hu, Chaochen, Li, Chao, Zhang, Yong, Yao, Yandong, Wang, Hao, Xing, Chunxiao
Recent years have seen massive time-series data generated in many areas. This different scenario brings new challenges, particularly in terms of data ingestion, where existing technologies struggle to handle such massive time-series data, leading to
Externí odkaz:
http://arxiv.org/abs/2406.05462
Video activity anticipation aims to predict what will happen in the future, embracing a broad application prospect ranging from robot vision and autonomous driving. Despite the recent progress, the data uncertainty issue, reflected as the content evo
Externí odkaz:
http://arxiv.org/abs/2404.18648
Confusing Pair Correction Based on Category Prototype for Domain Adaptation under Noisy Environments
In this paper, we address unsupervised domain adaptation under noisy environments, which is more challenging and practical than traditional domain adaptation. In this scenario, the model is prone to overfitting noisy labels, resulting in a more prono
Externí odkaz:
http://arxiv.org/abs/2403.12883
Three-Dimensional (3D) dense captioning is an emerging vision-language bridging task that aims to generate multiple detailed and accurate descriptions for 3D scenes. It presents significant potential and challenges due to its closer representation of
Externí odkaz:
http://arxiv.org/abs/2403.07469
Temporal Sentence Grounding in Video (TSGV) is troubled by dataset bias issue, which is caused by the uneven temporal distribution of the target moments for samples with similar semantic components in input videos or query texts. Existing methods res
Externí odkaz:
http://arxiv.org/abs/2401.07567
Recent text-to-image (T2I) diffusion models have achieved remarkable progress in generating high-quality images given text-prompts as input. However, these models fail to convey appropriate spatial composition specified by a layout instruction. In th
Externí odkaz:
http://arxiv.org/abs/2310.08872
Given an image and an associated textual question, the purpose of Knowledge-Based Visual Question Answering (KB-VQA) is to provide a correct answer to the question with the aid of external knowledge bases. Prior KB-VQA models are usually formulated a
Externí odkaz:
http://arxiv.org/abs/2310.08148
Autor:
Wang, Yabing, Wang, Shuhui, Luo, Hao, Dong, Jianfeng, Wang, Fan, Han, Meng, Wang, Xun, Wang, Meng
Current research on cross-modal retrieval is mostly English-oriented, as the availability of a large number of English-oriented human-labeled vision-language corpora. In order to break the limit of non-English labeled data, cross-lingual cross-modal
Externí odkaz:
http://arxiv.org/abs/2309.05451