Zobrazeno 1 - 10
of 167
pro vyhledávání: '"Yu, Wenyi"'
Autor:
Wang, Siyin, Yu, Wenyi, Yang, Yudong, Tang, Changli, Li, Yixuan, Zhuang, Jimin, Chen, Xianzhao, Tian, Xiaohai, Zhang, Jun, Sun, Guangzhi, Lu, Lu, Zhang, Chao
Speech quality assessment typically requires evaluating audio from multiple aspects, such as mean opinion score (MOS) and speaker similarity (SIM) etc., which can be challenging to cover using one small model designed for a single task. In this paper
Externí odkaz:
http://arxiv.org/abs/2409.16644
Diffusion-based generative models have recently achieved remarkable results in speech and vocal enhancement due to their ability to model complex speech data distributions. While these models generalize well to unseen acoustic environments, they may
Externí odkaz:
http://arxiv.org/abs/2409.09642
Autor:
Lou, Xingyu, Yang, Yu, Dong, Kuiyao, Huang, Heyuan, Yu, Wenyi, Wang, Ping, Li, Xiu, Wang, Jun
As the recommendation service needs to address increasingly diverse distributions, such as multi-population, multi-scenario, multitarget, and multi-interest, more and more recent works have focused on multi-distribution modeling and achieved great pr
Externí odkaz:
http://arxiv.org/abs/2408.01332
Autor:
Sun, Guangzhi, Yu, Wenyi, Tang, Changli, Chen, Xianzhao, Tan, Tian, Li, Wei, Lu, Lu, Ma, Zejun, Wang, Yuxuan, Zhang, Chao
Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, wh
Externí odkaz:
http://arxiv.org/abs/2406.15704
Autor:
Tang, Changli, Yu, Wenyi, Sun, Guangzhi, Chen, Xianzhao, Tan, Tian, Li, Wei, Zhang, Jun, Lu, Lu, Ma, Zejun, Wang, Yuxuan, Zhang, Chao
This paper explores enabling large language models (LLMs) to understand spatial information from multichannel audio, a skill currently lacking in auditory LLMs. By leveraging LLMs' advanced cognitive and inferential abilities, the aim is to enhance u
Externí odkaz:
http://arxiv.org/abs/2406.07914
Approximate Natural Gradient Descent (NGD) methods are an important family of optimisers for deep learning models, which use approximate Fisher information matrices to pre-condition gradients during training. The empirical Fisher (EF) method approxim
Externí odkaz:
http://arxiv.org/abs/2406.06420
Autor:
Chen, Zhe, Liu, Heyang, Yu, Wenyi, Sun, Guangzhi, Liu, Hongcheng, Wu, Ji, Zhang, Chao, Wang, Yu, Wang, Yanfeng
Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts an
Externí odkaz:
http://arxiv.org/abs/2403.14168
Autor:
Tang, Changli, Yu, Wenyi, Sun, Guangzhi, Chen, Xianzhao, Tan, Tian, Li, Wei, Lu, Lu, Ma, Zejun, Zhang, Chao
Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events
Externí odkaz:
http://arxiv.org/abs/2310.13289
Autor:
Sun, Guangzhi, Yu, Wenyi, Tang, Changli, Chen, Xianzhao, Tan, Tian, Li, Wei, Lu, Lu, Ma, Zejun, Zhang, Chao
Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a
Externí odkaz:
http://arxiv.org/abs/2310.05863
Autor:
Yu, Wenyi, Tang, Changli, Sun, Guangzhi, Chen, Xianzhao, Tan, Tian, Li, Wei, Lu, Lu, Ma, Zejun, Zhang, Chao
The impressive capability and versatility of large language models (LLMs) have aroused increasing attention in automatic speech recognition (ASR), with several pioneering studies attempting to build integrated ASR models by connecting a speech encode
Externí odkaz:
http://arxiv.org/abs/2309.13963