Výsledky vyhledávání

Report

Autoregressive Speech Synthesis without Vector Quantization

Autor: Meng, Lingwei, Zhou, Long, Liu, Shujie, Chen, Sanyuan, Han, Bing, Hu, Shujie, Liu, Yanqing, Li, Jinyu, Zhao, Sheng, Wu, Xixin, Meng, Helen, Wei, Furu

We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector qua

Externí odkaz: http://arxiv.org/abs/2407.08551

Zobrazit plný text záznamu

Report

Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation

Autor: Geng, Mengzhe, Xie, Xurong, Deng, Jiajun, Jin, Zengrui, Li, Guinan, Wang, Tianzi, Hu, Shujie, Li, Zhaoqing, Meng, Helen, Liu, Xunying

The application of data-intensive automatic speech recognition (ASR) technologies to dysarthric and elderly adult speech is confronted by their mismatch against healthy and nonaged voices, data scarcity and large speaker-level variability. To this en

Externí odkaz: http://arxiv.org/abs/2407.06310

Zobrazit plný text záznamu

Report

One-pass Multiple Conformer and Foundation Speech Systems Compression and Quantization Using An All-in-one Neural Model

Autor: Li, Zhaoqing, Xu, Haoning, Wang, Tianzi, Hu, Shoukang, Jin, Zengrui, Hu, Shujie, Deng, Jiajun, Cui, Mingyu, Geng, Mengzhe, Liu, Xunying

We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model. A single compression cycle allows multiple nested systems with varying Encoder depths, widths, and quantization precision s

Externí odkaz: http://arxiv.org/abs/2406.10160

Zobrazit plný text záznamu

Report

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

Autor: Li, Guinan, Deng, Jiajun, Chen, Youjun, Geng, Mengzhe, Hu, Shujie, Li, Zhe, Jin, Zengrui, Wang, Tianzi, Xie, Xurong, Meng, Helen, Liu, Xunying

This paper proposes joint speaker feature learning methods for zero-shot adaptation of audio-visual multichannel speech separation and recognition systems. xVector and ECAPA-TDNN speaker encoders are connected using purpose-built fusion blocks and ti

Externí odkaz: http://arxiv.org/abs/2406.10152

Zobrazit plný text záznamu

Report

Towards Effective and Efficient Non-autoregressive Decoding Using Block-based Attention Mask

Autor: Wang, Tianzi, Xie, Xurong, Li, Zhaoqing, Hu, Shoukang, Jing, Zengrui, Deng, Jiajun, Cui, Mingyu, Hu, Shujie, Geng, Mengzhe, Li, Guinan, Meng, Helen, Liu, Xunying

This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output l

Externí odkaz: http://arxiv.org/abs/2406.10034

Zobrazit plný text záznamu

Report

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Autor: Hu, Shujie, Zhou, Long, Liu, Shujie, Chen, Sanyuan, Hao, Hongkun, Pan, Jing, Liu, Xunying, Li, Jinyu, Sivasankaran, Sunit, Liu, Linquan, Wei, Furu

The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilitie

Externí odkaz: http://arxiv.org/abs/2404.00656

Zobrazit plný text záznamu

Report

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Autor: Wang, Huimeng, Jin, Zengrui, Geng, Mengzhe, Hu, Shujie, Li, Guinan, Wang, Tianzi, Xu, Haoning, Liu, Xunying

Automatic recognition of dysarthric speech remains a highly challenging task to date. Neuro-motor conditions and co-occurring physical disabilities create difficulty in large-scale data collection for ASR system development. Adapting SSL pre-trained

Externí odkaz: http://arxiv.org/abs/2401.00662

Zobrazit plný text záznamu

Report

Boosting Large Language Model for Speech Synthesis: An Empirical Study

Autor: Hao, Hongkun, Zhou, Long, Liu, Shujie, Li, Jinyu, Hu, Shujie, Wang, Rui, Wei, Furu

Large language models (LLMs) have made significant advancements in natural language processing and are concurrently extending the language ability to other modalities, such as speech and vision. Nevertheless, most of the previous work focuses on prom

Externí odkaz: http://arxiv.org/abs/2401.00246

Zobrazit plný text záznamu

Report

Towards Automatic Data Augmentation for Disordered Speech Recognition

Autor: Jin, Zengrui, Xie, Xurong, Wang, Tianzi, Geng, Mengzhe, Deng, Jiajun, Li, Guinan, Hu, Shujie, Liu, Xunying

Automatic recognition of disordered speech remains a highly challenging task to date due to data scarcity. This paper presents a reinforcement learning (RL) based on-the-fly data augmentation approach for training state-of-the-art PyChain TDNN and en

Externí odkaz: http://arxiv.org/abs/2312.08641

Zobrazit plný text záznamu

Report

Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation and Recognition

Autor: Li, Guinan, Deng, Jiajun, Geng, Mengzhe, Jin, Zengrui, Wang, Tianzi, Hu, Shujie, Cui, Mingyu, Meng, Helen, Liu, Xunying

Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-chan

Externí odkaz: http://arxiv.org/abs/2307.02909

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání