Zobrazeno 1 - 10
of 593
pro vyhledávání: '"Xie, Fenglong"'
The neural codec language model (CLM) has demonstrated remarkable performance in text-to-speech (TTS) synthesis. However, troubled by ``recency bias", CLM lacks sufficient attention to coarse-grained information at a higher temporal scale, often prod
Externí odkaz:
http://arxiv.org/abs/2409.11630
The long speech sequence has been troubling language models (LM) based TTS approaches in terms of modeling complexity and efficiency. This work proposes SoCodec, a semantic-ordered multi-stream speech codec, to address this issue. It compresses speec
Externí odkaz:
http://arxiv.org/abs/2409.00933
VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewo
Externí odkaz:
http://arxiv.org/abs/2406.02940
This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements via Vector-Quantized Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more unlabeled speech audio. Thi
Externí odkaz:
http://arxiv.org/abs/2309.00126
This paper aims to enhance low-resource TTS by reducing training data requirements using compact speech representations. A Multi-Stage Multi-Codebook (MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to waveforms. Subsequentl
Externí odkaz:
http://arxiv.org/abs/2210.15131
We propose a Multi-Stage, Multi-Codebook (MSMC) approach to high-performance neural TTS synthesis. A vector-quantized, variational autoencoder (VQ-VAE) based feature analyzer is used to encode Mel spectrograms of speech training data by down-sampling
Externí odkaz:
http://arxiv.org/abs/2209.10887
This paper presents Nana-HDR, a new non-attentive non-autoregressive model with hybrid Transformer-based Dense-fuse encoder and RNN-based decoder for TTS. It mainly consists of three parts: Firstly, a novel Dense-fuse encoder with dense connections b
Externí odkaz:
http://arxiv.org/abs/2109.13673
Autor:
Stradford, Laura, Curtis, Jeffrey R., Zueger, Patrick, Xie, Fenglong, Curtis, David, Gavigan, Kelly, Clinton, Cassie, Venkatachalam, Shilpa, Rivera, Esteban, Nowell, W. Benjamin
Publikováno v:
In Contemporary Clinical Trials Communications April 2024 38
Publikováno v:
In Applied Acoustics 15 March 2024 218
In this work, a robust and efficient text-to-speech (TTS) synthesis system named Triple M is proposed for large-scale online application. The key components of Triple M are: 1) A sequence-to-sequence model adopts a novel multi-guidance attention to t
Externí odkaz:
http://arxiv.org/abs/2102.00247