Zobrazeno 1 - 10
of 112
pro vyhledávání: '"Yang, Dongchao"'
Autor:
Yang, Dongchao, Guo, Haohan, Wang, Yuanyuan, Huang, Rongjie, Li, Xiang, Tan, Xu, Wu, Xixin, Meng, Helen
The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, emp
Externí odkaz:
http://arxiv.org/abs/2406.10056
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. In this paper, we propose a multi-modal DSR model by leveraging neural codec lan
Externí odkaz:
http://arxiv.org/abs/2406.08336
VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewo
Externí odkaz:
http://arxiv.org/abs/2406.02940
In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignme
Externí odkaz:
http://arxiv.org/abs/2406.02328
Autor:
Xin, Detai, Tan, Xu, Shen, Kai, Ju, Zeqian, Yang, Dongchao, Wang, Yuancheng, Takamichi, Shinnosuke, Saruwatari, Hiroshi, Liu, Shujie, Li, Jinyu, Zhao, Sheng
We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as
Externí odkaz:
http://arxiv.org/abs/2404.03204
Autor:
Ju, Zeqian, Wang, Yuancheng, Shen, Kai, Tan, Xu, Xin, Detai, Yang, Dongchao, Liu, Yanqing, Leng, Yichong, Song, Kaitao, Tang, Siliang, Wu, Zhizheng, Qin, Tao, Li, Xiang-Yang, Ye, Wei, Zhang, Shikun, Bian, Jiang, He, Lei, Li, Jinyu, Zhao, Sheng
While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre,
Externí odkaz:
http://arxiv.org/abs/2403.03100
Autor:
Wang, Yuanyuan, Chen, Hangting, Yang, Dongchao, Yu, Jianwei, Weng, Chao, Wu, Zhiyong, Meng, Helen
The query-based audio separation usually employs specific queries to extract target sources from a mixture of audio signals. Currently, most query-based separation models need additional networks to obtain query embedding. In this way, separation mod
Externí odkaz:
http://arxiv.org/abs/2312.15463
Common target sound extraction (TSE) approaches primarily relied on discriminative approaches in order to separate the target sound while minimizing interference from the unwanted sources, with varying success in separating the target from the backgr
Externí odkaz:
http://arxiv.org/abs/2310.04567
Autor:
Yang, Dongchao, Tian, Jinchuan, Tan, Xu, Huang, Rongjie, Liu, Songxiang, Chang, Xuankai, Shi, Jiatong, Zhao, Sheng, Bian, Jiang, Wu, Xixin, Zhao, Zhou, Watanabe, Shinji, Meng, Helen
Large Language models (LLM) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LLM techniques to generate multiple types of audio
Externí odkaz:
http://arxiv.org/abs/2310.00704
Autor:
Leng, Yichong, Guo, Zhifang, Shen, Kai, Tan, Xu, Ju, Zeqian, Liu, Yanqing, Liu, Yufei, Yang, Dongchao, Zhang, Leying, Song, Kaitao, He, Lei, Li, Xiang-Yang, Zhao, Sheng, Qin, Tao, Bian, Jiang
Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using
Externí odkaz:
http://arxiv.org/abs/2309.02285