Zobrazeno 1 - 10
of 483
pro vyhledávání: '"Dang, Jianwu"'
Autor:
Wang, Tianrui, Li, Jin, Ma, Ziyang, Cao, Rui, Chen, Xie, Wang, Longbiao, Ge, Meng, Wang, Xiaobao, Wang, Yuguang, Dang, Jianwu, Tashi, Nyima
Self-supervised learning (SSL) has garnered significant attention in speech processing, excelling in linguistic tasks such as speech recognition. However, jointly improving the performance of pre-trained models on various downstream tasks, each requi
Externí odkaz:
http://arxiv.org/abs/2409.00387
Autor:
Qiang, Chunyu, Geng, Wang, Zhao, Yi, Fu, Ruibo, Wang, Tao, Gong, Cheng, Wang, Tianrui, Liu, Qiuyu, Yi, Jiangyan, Wen, Zhengqi, Zhang, Chen, Che, Hao, Wang, Longbiao, Dang, Jianwu, Tao, Jianhua
Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) se
Externí odkaz:
http://arxiv.org/abs/2408.05758
Accurately finding the wrong words in the automatic speech recognition (ASR) hypothesis and recovering them well-founded is the goal of speech error correction. In this paper, we propose a non-autoregressive speech error correction method. A Confiden
Externí odkaz:
http://arxiv.org/abs/2407.12817
Autor:
Gong, Cheng, Cooper, Erica, Wang, Xin, Qiang, Chunyu, Geng, Mengzhe, Wells, Dan, Wang, Longbiao, Dang, Jianwu, Tessier, Marc, Pine, Aidan, Richmond, Korin, Yamagishi, Junichi
Self-supervised learning (SSL) representations from massively multilingual models offer a promising solution for low-resource language speech tasks. Despite advancements, language adaptation in TTS systems remains an open problem. This paper explores
Externí odkaz:
http://arxiv.org/abs/2406.08911
Emotion Recognition in Conversations (ERC) is a popular task in natural language processing, which aims to recognize the emotional state of the speaker in conversations. While current research primarily emphasizes contextual modeling, there exists a
Externí odkaz:
http://arxiv.org/abs/2407.00743
Autor:
Gong, Cheng, Wang, Xin, Cooper, Erica, Wells, Dan, Wang, Longbiao, Dang, Jianwu, Richmond, Korin, Yamagishi, Junichi
Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TT
Externí odkaz:
http://arxiv.org/abs/2312.14398
Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-
Externí odkaz:
http://arxiv.org/abs/2312.11201
In this paper, we study the mistake bound of online kernel learning on a budget. We propose a new budgeted online kernel learning model, called Ahpatron, which significantly improves the mistake bound of previous work and resolves the open problem po
Externí odkaz:
http://arxiv.org/abs/2312.07032
Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(se
Externí odkaz:
http://arxiv.org/abs/2309.15512
For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representations extracted from speech should serve as a "bridge" betw
Externí odkaz:
http://arxiv.org/abs/2309.00424