Výsledky vyhledávání

Report

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

Autor: Wang, Siyin, Yu, Wenyi, Yang, Yudong, Tang, Changli, Li, Yixuan, Zhuang, Jimin, Chen, Xianzhao, Tian, Xiaohai, Zhang, Jun, Sun, Guangzhi, Lu, Lu, Zhang, Chao

Speech quality assessment typically requires evaluating audio from multiple aspects, such as mean opinion score (MOS) and speaker similarity (SIM) etc., which can be challenging to cover using one small model designed for a single task. In this paper

Externí odkaz: http://arxiv.org/abs/2409.16644

Zobrazit plný text záznamu

Report

Extract and Diffuse: Latent Integration for Improved Diffusion-based Speech and Vocal Enhancement

Autor: Yang, Yudong, Liu, Zhan, Yu, Wenyi, Sun, Guangzhi, Kong, Qiuqiang, Zhang, Chao

Diffusion-based generative models have recently achieved remarkable results in speech and vocal enhancement due to their ability to model complex speech data distributions. While these models generalize well to unseen acoustic environments, they may

Externí odkaz: http://arxiv.org/abs/2409.09642

Zobrazit plný text záznamu

Report

HMDN: Hierarchical Multi-Distribution Network for Click-Through Rate Prediction

Autor: Lou, Xingyu, Yang, Yu, Dong, Kuiyao, Huang, Heyuan, Yu, Wenyi, Wang, Ping, Li, Xiu, Wang, Jun

As the recommendation service needs to address increasingly diverse distributions, such as multi-population, multi-scenario, multitarget, and multi-interest, more and more recent works have focused on multi-distribution modeling and achieved great pr

Externí odkaz: http://arxiv.org/abs/2408.01332

Zobrazit plný text záznamu

Report

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Autor: Sun, Guangzhi, Yu, Wenyi, Tang, Changli, Chen, Xianzhao, Tan, Tian, Li, Wei, Lu, Lu, Ma, Zejun, Wang, Yuxuan, Zhang, Chao

Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, wh

Externí odkaz: http://arxiv.org/abs/2406.15704

Zobrazit plný text záznamu

Report

Can Large Language Models Understand Spatial Audio?

Autor: Tang, Changli, Yu, Wenyi, Sun, Guangzhi, Chen, Xianzhao, Tan, Tian, Li, Wei, Zhang, Jun, Lu, Lu, Ma, Zejun, Wang, Yuxuan, Zhang, Chao

This paper explores enabling large language models (LLMs) to understand spatial information from multichannel audio, a skill currently lacking in auditory LLMs. By leveraging LLMs' advanced cognitive and inferential abilities, the aim is to enhance u

Externí odkaz: http://arxiv.org/abs/2406.07914

Zobrazit plný text záznamu

Report

An Improved Empirical Fisher Approximation for Natural Gradient Descent

Autor: Wu, Xiaodong, Yu, Wenyi, Zhang, Chao, Woodland, Philip

Approximate Natural Gradient Descent (NGD) methods are an important family of optimisers for deep learning models, which use approximate Fisher information matrices to pre-condition gradients during training. The empirical Fisher (EF) method approxim

Externí odkaz: http://arxiv.org/abs/2406.06420

Zobrazit plný text záznamu

Report

M$^3$AV: A Multimodal, Multigenre, and Multipurpose Audio-Visual Academic Lecture Dataset

Autor: Chen, Zhe, Liu, Heyang, Yu, Wenyi, Sun, Guangzhi, Liu, Hongcheng, Wu, Ji, Zhang, Chao, Wang, Yu, Wang, Yanfeng

Publishing open-source academic video recordings is an emergent and prevalent approach to sharing knowledge online. Such videos carry rich multimodal information including speech, the facial and body movements of the speakers, as well as the texts an

Externí odkaz: http://arxiv.org/abs/2403.14168

Zobrazit plný text záznamu

Report

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Autor: Tang, Changli, Yu, Wenyi, Sun, Guangzhi, Chen, Xianzhao, Tan, Tian, Li, Wei, Lu, Lu, Ma, Zejun, Zhang, Chao

Hearing is arguably an essential ability of artificial intelligence (AI) agents in the physical world, which refers to the perception and understanding of general auditory information consisting of at least three types of sounds: speech, audio events

Externí odkaz: http://arxiv.org/abs/2310.13289

Zobrazit plný text záznamu

Report

Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

Autor: Sun, Guangzhi, Yu, Wenyi, Tang, Changli, Chen, Xianzhao, Tan, Tian, Li, Wei, Lu, Lu, Ma, Zejun, Zhang, Chao

Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a

Externí odkaz: http://arxiv.org/abs/2310.05863

Zobrazit plný text záznamu

Report

Connecting Speech Encoder and Large Language Model for ASR

Autor: Yu, Wenyi, Tang, Changli, Sun, Guangzhi, Chen, Xianzhao, Tan, Tian, Li, Wei, Lu, Lu, Ma, Zejun, Zhang, Chao

The impressive capability and versatility of large language models (LLMs) have aroused increasing attention in automatic speech recognition (ASR), with several pioneering studies attempting to build integrated ASR models by connecting a speech encode

Externí odkaz: http://arxiv.org/abs/2309.13963

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání