Výsledky vyhledávání - "Yang, Dongchao"

Report

UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner

Autor: Yang, Dongchao, Guo, Haohan, Wang, Yuanyuan, Huang, Rongjie, Li, Xiang, Tan, Xu, Wu, Xixin, Meng, Helen

The Large Language models (LLMs) have demonstrated supreme capabilities in text understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, emp

Externí odkaz: http://arxiv.org/abs/2406.10056

Zobrazit plný text záznamu

Report

CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction

Autor: Chen, Xueyuan, Yang, Dongchao, Wang, Dingdong, Wu, Xixin, Wu, Zhiyong, Meng, Helen

Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. In this paper, we propose a multi-modal DSR model by leveraging neural codec lan

Externí odkaz: http://arxiv.org/abs/2406.08336

Zobrazit plný text záznamu

Report

Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder

Autor: Guo, Haohan, Xie, Fenglong, Yang, Dongchao, Lu, Hui, Wu, Xixin, Meng, Helen

VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewo

Externí odkaz: http://arxiv.org/abs/2406.02940

Zobrazit plný text záznamu

Report

SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models

Autor: Yang, Dongchao, Wang, Dingdong, Guo, Haohan, Chen, Xueyuan, Wu, Xixin, Meng, Helen

In this study, we propose a simple and efficient Non-Autoregressive (NAR) text-to-speech (TTS) system based on diffusion, named SimpleSpeech. Its simpleness shows in three aspects: (1) It can be trained on the speech-only dataset, without any alignme

Externí odkaz: http://arxiv.org/abs/2406.02328

Zobrazit plný text záznamu

Report

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Autor: Xin, Detai, Tan, Xu, Shen, Kai, Ju, Zeqian, Yang, Dongchao, Wang, Yuancheng, Takamichi, Shinnosuke, Saruwatari, Hiroshi, Liu, Shujie, Li, Jinyu, Zhao, Sheng

We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as

Externí odkaz: http://arxiv.org/abs/2404.03204

Zobrazit plný text záznamu

Report

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Autor: Ju, Zeqian, Wang, Yuancheng, Shen, Kai, Tan, Xu, Xin, Detai, Yang, Dongchao, Liu, Yanqing, Leng, Yichong, Song, Kaitao, Tang, Siliang, Wu, Zhizheng, Qin, Tao, Li, Xiang-Yang, Ye, Wei, Zhang, Shikun, Bian, Jiang, He, Lei, Li, Jinyu, Zhao, Sheng

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre,

Externí odkaz: http://arxiv.org/abs/2403.03100

Zobrazit plný text záznamu

Report

Consistent and Relevant: Rethink the Query Embedding in General Sound Separation

Autor: Wang, Yuanyuan, Chen, Hangting, Yang, Dongchao, Yu, Jianwei, Weng, Chao, Wu, Zhiyong, Meng, Helen

The query-based audio separation usually employs specific queries to extract target sources from a mixture of audio signals. Currently, most query-based separation models need additional networks to obtain query embedding. In this way, separation mod

Externí odkaz: http://arxiv.org/abs/2312.15463

Zobrazit plný text záznamu

Report

DPM-TSE: A Diffusion Probabilistic Model for Target Sound Extraction

Autor: Hai, Jiarui, Wang, Helin, Yang, Dongchao, Thakkar, Karan, Dehak, Najim, Elhilali, Mounya

Common target sound extraction (TSE) approaches primarily relied on discriminative approaches in order to separate the target sound while minimizing interference from the unwanted sources, with varying success in separating the target from the backgr

Externí odkaz: http://arxiv.org/abs/2310.04567

Zobrazit plný text záznamu

Report

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

Autor: Yang, Dongchao, Tian, Jinchuan, Tan, Xu, Huang, Rongjie, Liu, Songxiang, Chang, Xuankai, Shi, Jiatong, Zhao, Sheng, Bian, Jiang, Wu, Xixin, Zhao, Zhou, Watanabe, Shinji, Meng, Helen

Large Language models (LLM) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LLM techniques to generate multiple types of audio

Externí odkaz: http://arxiv.org/abs/2310.00704

Zobrazit plný text záznamu

Report

PromptTTS 2: Describing and Generating Voices with Text Prompt

Autor: Leng, Yichong, Guo, Zhifang, Shen, Kai, Tan, Xu, Ju, Zeqian, Liu, Yanqing, Liu, Yufei, Yang, Dongchao, Zhang, Leying, Song, Kaitao, He, Lei, Li, Xiang-Yang, Zhao, Sheng, Qin, Tao, Bian, Jiang

Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using

Externí odkaz: http://arxiv.org/abs/2309.02285

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání