Výsledky vyhledávání - "Zhu, Zhihong"

Report

DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval

Autor: Xin, Yifei, Cheng, Xuxin, Zhu, Zhihong, Yang, Xusheng, Zou, Yuexian

Existing audio-text retrieval (ATR) methods are essentially discriminative models that aim to maximize the conditional likelihood, represented as p(candidates|query). Nevertheless, this methodology fails to consider the intrinsic data distribution p(

Externí odkaz: http://arxiv.org/abs/2409.10025

Zobrazit plný text záznamu

Report

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation

Autor: Xin, Yifei, Zhu, Zhihong, Cheng, Xuxin, Yang, Xusheng, Zou, Yuexian

Most existing audio-text retrieval (ATR) approaches typically rely on a single-level interaction to associate audio and text, limiting their ability to align different modalities and leading to suboptimal matches. In this work, we present a novel ATR

Externí odkaz: http://arxiv.org/abs/2409.09256

Zobrazit plný text záznamu

Report

Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective

Autor: Hu, Guimin, Xin, Yi, Lyu, Weimin, Huang, Haojian, Sun, Chang, Zhu, Zhihong, Gui, Lin, Cai, Ruichu

Multimodal affective computing (MAC) has garnered increasing attention due to its broad applications in analyzing human behaviors and intentions, especially in text-dominated multimodal affective computing field. This survey presents the recent trend

Externí odkaz: http://arxiv.org/abs/2409.07388

Zobrazit plný text záznamu

Report

XMeCap: Meme Caption Generation with Sub-Image Adaptability

Autor: Chen, Yuyan, Yan, Songzhou, Zhu, Zhihong, Li, Zhixu, Xiao, Yanghua

Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated distinctively b

Externí odkaz: http://arxiv.org/abs/2407.17152

Zobrazit plný text záznamu

Report

MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?

Autor: Chen, Zhaorun, Du, Yichao, Wen, Zichen, Zhou, Yiyang, Cui, Chenhang, Weng, Zhenzhen, Tu, Haoqin, Wang, Chaoqi, Tong, Zhengwei, Huang, Qinglan, Chen, Canyu, Ye, Qinghao, Zhu, Zhihong, Zhang, Yuqing, Zhou, Jiawei, Zhao, Zhuokai, Rafailov, Rafael, Finn, Chelsea, Yao, Huaxiu

While text-to-image models like DALLE-3 and Stable Diffusion are rapidly proliferating, they often encounter challenges such as hallucination, bias, and the production of unsafe, low-quality output. To effectively address these issues, it is crucial

Externí odkaz: http://arxiv.org/abs/2407.04842

Zobrazit plný text záznamu

Report

LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

Autor: Wan, Zhongwei, Wu, Ziang, Liu, Che, Huang, Jinfa, Zhu, Zhihong, Jin, Peng, Wang, Longyue, Yuan, Li

Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unl

Externí odkaz: http://arxiv.org/abs/2406.18139

Zobrazit plný text záznamu

Report

D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models

Autor: Wan, Zhongwei, Wu, Xinjian, Zhang, Yu, Xin, Yi, Tao, Chaofan, Zhu, Zhihong, Wang, Xin, Luo, Siqi, Xiong, Jing, Zhang, Mi

Efficient inference in Large Language Models (LLMs) is impeded by the growing memory demands of key-value (KV) caching, especially for longer sequences. Traditional KV cache eviction strategies, which prioritize less critical KV-pairs based on attent

Externí odkaz: http://arxiv.org/abs/2406.13035

Zobrazit plný text záznamu

Report

Instantaneous optical singularities and duality-protected dark directions

Autor: Wen, Chunchao, Zhang, Jianfa, Zhang, Chaofan, Qin, Shiqiao, Zhu, Zhihong, Liu, Wei

Electromagnetic waves are described by not only polarization ellipses but also cyclically rotating vectors tracing out them. The corresponding fields are respectively directionless steady line fields and directional instantaneous vector fields. Here

Externí odkaz: http://arxiv.org/abs/2406.06132

Zobrazit plný text záznamu

Report

Towards Spoken Language Understanding via Multi-level Multi-grained Contrastive Learning

Autor: Cheng, Xuxin, Xu, Wanshi, Zhu, Zhihong, Li, Hongxiang, Zou, Yuexian

Spoken language understanding (SLU) is a core task in task-oriented dialogue systems, which aims at understanding the user's current goal through constructing semantic frames. SLU usually consists of two subtasks, including intent detection and slot

Externí odkaz: http://arxiv.org/abs/2405.20852

Zobrazit plný text záznamu

Report

Textual Inversion and Self-supervised Refinement for Radiology Report Generation

Autor: Luo, Yuanjiang, Li, Hongxiang, Wu, Xuan, Cao, Meng, Huang, Xiaoshuang, Zhu, Zhihong, Liao, Peixi, Chen, Hu, Zhang, Yi

Existing mainstream approaches follow the encoder-decoder paradigm for generating radiology reports. They focus on improving the network structure of encoders and decoders, which leads to two shortcomings: overlooking the modality gap and ignoring re

Externí odkaz: http://arxiv.org/abs/2405.20607

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání