Zobrazeno 1 - 10
of 1 176
pro vyhledávání: '"Zhu, Zhihong"'
Existing audio-text retrieval (ATR) methods are essentially discriminative models that aim to maximize the conditional likelihood, represented as p(candidates|query). Nevertheless, this methodology fails to consider the intrinsic data distribution p(
Externí odkaz:
http://arxiv.org/abs/2409.10025
Most existing audio-text retrieval (ATR) approaches typically rely on a single-level interaction to associate audio and text, limiting their ability to align different modalities and leading to suboptimal matches. In this work, we present a novel ATR
Externí odkaz:
http://arxiv.org/abs/2409.09256
Autor:
Hu, Guimin, Xin, Yi, Lyu, Weimin, Huang, Haojian, Sun, Chang, Zhu, Zhihong, Gui, Lin, Cai, Ruichu
Multimodal affective computing (MAC) has garnered increasing attention due to its broad applications in analyzing human behaviors and intentions, especially in text-dominated multimodal affective computing field. This survey presents the recent trend
Externí odkaz:
http://arxiv.org/abs/2409.07388
Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively b
Externí odkaz:
http://arxiv.org/abs/2407.17152
Autor:
Chen, Zhaorun, Du, Yichao, Wen, Zichen, Zhou, Yiyang, Cui, Chenhang, Weng, Zhenzhen, Tu, Haoqin, Wang, Chaoqi, Tong, Zhengwei, Huang, Qinglan, Chen, Canyu, Ye, Qinghao, Zhu, Zhihong, Zhang, Yuqing, Zhou, Jiawei, Zhao, Zhuokai, Rafailov, Rafael, Finn, Chelsea, Yao, Huaxiu
While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial
Externí odkaz:
http://arxiv.org/abs/2407.04842
Autor:
Wan, Zhongwei, Wu, Ziang, Liu, Che, Huang, Jinfa, Zhu, Zhihong, Jin, Peng, Wang, Longyue, Yuan, Li
Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unl
Externí odkaz:
http://arxiv.org/abs/2406.18139
Autor:
Wan, Zhongwei, Wu, Xinjian, Zhang, Yu, Xin, Yi, Tao, Chaofan, Zhu, Zhihong, Wang, Xin, Luo, Siqi, Xiong, Jing, Zhang, Mi
Efficient inference in Large Language Models (LLMs) is impeded by the growing memory demands of key-value (KV) caching, especially for longer sequences. Traditional KV cache eviction strategies, which prioritize less critical KV-pairs based on attent
Externí odkaz:
http://arxiv.org/abs/2406.13035
Electromagnetic waves are described by not only polarization ellipses but also cyclically rotating vectors tracing out them. The corresponding fields are respectively directionless steady line fields and directional instantaneous vector fields. Here
Externí odkaz:
http://arxiv.org/abs/2406.06132
Spoken language understanding (SLU) is a core task in task-oriented dialogue systems, which aims at understanding the user's current goal through constructing semantic frames. SLU usually consists of two subtasks, including intent detection and slot
Externí odkaz:
http://arxiv.org/abs/2405.20852
Autor:
Luo, Yuanjiang, Li, Hongxiang, Wu, Xuan, Cao, Meng, Huang, Xiaoshuang, Zhu, Zhihong, Liao, Peixi, Chen, Hu, Zhang, Yi
Existing mainstream approaches follow the encoder-decoder paradigm for generating radiology reports. They focus on improving the network structure of encoders and decoders, which leads to two shortcomings: overlooking the modality gap and ignoring re
Externí odkaz:
http://arxiv.org/abs/2405.20607