Výsledky vyhledávání

Report

Autoregressive Speech Synthesis without Vector Quantization

Autor: Meng, Lingwei, Zhou, Long, Liu, Shujie, Chen, Sanyuan, Han, Bing, Hu, Shujie, Liu, Yanqing, Li, Jinyu, Zhao, Sheng, Wu, Xixin, Meng, Helen, Wei, Furu

We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector qua

Externí odkaz: http://arxiv.org/abs/2407.08551

Zobrazit plný text záznamu

Report

VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment

Autor: Han, Bing, Zhou, Long, Liu, Shujie, Chen, Sanyuan, Meng, Lingwei, Qian, Yanming, Liu, Yanqing, Zhao, Sheng, Li, Jinyu, Wei, Furu

With the help of discrete neural audio codecs, large language models (LLM) have increasingly been recognized as a promising methodology for zero-shot Text-to-Speech (TTS) synthesis. However, sampling based decoding strategies bring astonishing divers

Externí odkaz: http://arxiv.org/abs/2406.07855

Zobrazit plný text záznamu

Report

Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation

Autor: Wang, Peidong, Xue, Jian, Li, Jinyu, Chen, Junkun, Subramanian, Aswin Shanmugam

Language-agnostic many-to-one end-to-end speech translation models can convert audio signals from different source languages into text in a target language. These models do not need source language identification, which improves user experience. In s

Externí odkaz: http://arxiv.org/abs/2406.10276

Zobrazit plný text záznamu

Report

An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

Autor: Wang, Xiaofei, Eskimez, Sefik Emre, Thakker, Manthan, Yang, Hemin, Zhu, Zirun, Tang, Min, Xia, Yufei, Li, Jinzhu, Zhao, Sheng, Li, Jinyu, Kanda, Naoyuki

Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt conta

Externí odkaz: http://arxiv.org/abs/2406.05699

Zobrazit plný text záznamu

Report

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

Autor: Chen, Sanyuan, Liu, Shujie, Zhou, Long, Liu, Yanqing, Tan, Xu, Li, Jinyu, Zhao, Sheng, Qian, Yao, Wei, Furu

This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration

Externí odkaz: http://arxiv.org/abs/2406.05370

Zobrazit plný text záznamu

Report

Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

Autor: Eskimez, Sefik Emre, Wang, Xiaofei, Thakker, Manthan, Tsai, Chung-Hsien, Li, Canrun, Xiao, Zhen, Yang, Hemin, Zhu, Zirun, Tang, Min, Li, Jinyu, Zhao, Sheng, Kanda, Naoyuki

Accurate control of the total duration of generated speech by adjusting the speech rate is crucial for various text-to-speech (TTS) applications. However, the impact of adjusting the speech rate on speech quality, such as intelligibility and speaker

Externí odkaz: http://arxiv.org/abs/2406.04281

Zobrazit plný text záznamu

Report

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Autor: Le, Chenyang, Qian, Yao, Wang, Dongmei, Zhou, Long, Liu, Shujie, Wang, Xiaofei, Yousefi, Midia, Qian, Yanmin, Li, Jinyu, Zhao, Sheng, Zeng, Michael

There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeli

Externí odkaz: http://arxiv.org/abs/2405.17809

Zobrazit plný text záznamu

Report

CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations

Autor: Zhang, Leying, Qian, Yao, Zhou, Long, Liu, Shujie, Wang, Dongmei, Wang, Xiaofei, Yousefi, Midia, Qian, Yanmin, Li, Jinyu, He, Lei, Zhao, Sheng, Zeng, Michael

Recent advancements in zero-shot text-to-speech (TTS) modeling have led to significant strides in generating high-fidelity and diverse speech. However, dialogue generation, along with achieving human-like naturalness in speech, continues to be a chal

Externí odkaz: http://arxiv.org/abs/2404.06690

Zobrazit plný text záznamu

Report

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Autor: Xin, Detai, Tan, Xu, Shen, Kai, Ju, Zeqian, Yang, Dongchao, Wang, Yuancheng, Takamichi, Shinnosuke, Saruwatari, Hiroshi, Liu, Shujie, Li, Jinyu, Zhao, Sheng

We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as

Externí odkaz: http://arxiv.org/abs/2404.03204

Zobrazit plný text záznamu

Report

WavLLM: Towards Robust and Adaptive Speech Large Language Model

Autor: Hu, Shujie, Zhou, Long, Liu, Shujie, Chen, Sanyuan, Hao, Hongkun, Pan, Jing, Liu, Xunying, Li, Jinyu, Sivasankaran, Sunit, Liu, Linquan, Wei, Furu

The recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, progressively broadening their scope to multimodal perception and generation. However, effectively integrating listening capabilitie

Externí odkaz: http://arxiv.org/abs/2404.00656

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání