Zobrazeno 1 - 10
of 232
pro vyhledávání: '"Cui, Mingyu"'
With the rise of Speech Large Language Models (Speech LLMs), there has been growing interest in discrete speech tokens for their ability to integrate with text-based tokens seamlessly. Compared to most studies that focus on continuous speech features
Externí odkaz:
http://arxiv.org/abs/2411.08742
Grapheme-to-phoneme (G2P) conversion is a crucial step in Text-to-Speech (TTS) systems, responsible for mapping grapheme to corresponding phonetic representations. However, it faces ambiguities problems where the same grapheme can represent multiple
Externí odkaz:
http://arxiv.org/abs/2411.07563
Autor:
Kang, Jiawen, Meng, Lingwei, Cui, Mingyu, Wang, Yuejiao, Wu, Xixin, Liu, Xunying, Meng, Helen
Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and transcribing overlapping speech. To address these challenges, this paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglem
Externí odkaz:
http://arxiv.org/abs/2409.12388
Autor:
Cui, Mingyu, Tan, Daxin, Yang, Yifan, Wang, Dingdong, Wang, Huimeng, Chen, Xiao, Chen, Xie, Liu, Xunying
With the advancement of Self-supervised Learning (SSL) in speech-related tasks, there has been growing interest in utilizing discrete tokens generated by SSL for automatic speech recognition (ASR), as they offer faster processing techniques. However,
Externí odkaz:
http://arxiv.org/abs/2409.08805
Autor:
Cui, Mingyu, Yang, Yifan, Deng, Jiajun, Kang, Jiawen, Hu, Shujie, Wang, Tianzi, Li, Zhaoqing, Zhang, Shiliang, Chen, Xie, Liu, Xunying
Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable. In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in
Externí odkaz:
http://arxiv.org/abs/2409.08797
Autor:
Hu, Shujie, Xie, Xurong, Geng, Mengzhe, Jin, Zengrui, Deng, Jiajun, Li, Guinan, Wang, Yi, Cui, Mingyu, Wang, Tianzi, Meng, Helen, Liu, Xunying
Self-supervised learning (SSL) based speech foundation models have been applied to a wide range of ASR tasks. However, their application to dysarthric and elderly speech via data-intensive parameter fine-tuning is confronted by in-domain data scarcit
Externí odkaz:
http://arxiv.org/abs/2407.13782
Autor:
Yang, Yifan, Song, Zheshu, Zhuo, Jianheng, Cui, Mingyu, Li, Jinpeng, Yang, Bo, Du, Yexing, Ma, Ziyang, Liu, Xunying, Wang, Ziyuan, Li, Ke, Fan, Shuai, Yu, Kai, Zhang, Wei-Qiang, Chen, Guoguo, Chen, Xie
The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpe
Externí odkaz:
http://arxiv.org/abs/2406.11546
Autor:
Li, Zhaoqing, Xu, Haoning, Wang, Tianzi, Hu, Shoukang, Jin, Zengrui, Hu, Shujie, Deng, Jiajun, Cui, Mingyu, Geng, Mengzhe, Liu, Xunying
We propose a novel one-pass multiple ASR systems joint compression and quantization approach using an all-in-one neural model. A single compression cycle allows multiple nested systems with varying Encoder depths, widths, and quantization precision s
Externí odkaz:
http://arxiv.org/abs/2406.10160
Autor:
Wang, Tianzi, Xie, Xurong, Li, Zhaoqing, Hu, Shoukang, Jin, Zengrui, Deng, Jiajun, Cui, Mingyu, Hu, Shujie, Geng, Mengzhe, Li, Guinan, Meng, Helen, Liu, Xunying
This paper proposes a novel non-autoregressive (NAR) block-based Attention Mask Decoder (AMD) that flexibly balances performance-efficiency trade-offs for Conformer ASR systems. AMD performs parallel NAR inference within contiguous blocks of output l
Externí odkaz:
http://arxiv.org/abs/2406.10034
End-to-end multi-talker speech recognition has garnered great interest as an effective approach to directly transcribe overlapped speech from multiple speakers. Current methods typically adopt either 1) single-input multiple-output (SIMO) models with
Externí odkaz:
http://arxiv.org/abs/2401.04152