Popis: |
Recent state-of-the-art neural text-to-speech synthesis models have significantly improved the quality of synthesized speech. However, the previous methods have remained several problems. While autoregressive models suffer from slow inference speed, non-autoregressive models usually have a complicated, time and memory-consuming training pipeline. This paper proposes a novel model called FastTacotron, which is an improved text-to-speech method based on ForwardTacotron. The proposed model uses the recurrent Tacotron architecture but replacing its autoregressive attentive part with a single forward pass to accelerate the inference speed. The model also replaces the attention mechanism in Tacotron with a length regulator like the one in FastSpeech for parallel mel-spectrogram generation. Moreover, we introduce more prosodic information of speech (e.g., pitch, energy, and more accurate duration) as conditional inputs to make the duration predictor more accurate. Experiments show that our model matches state-of-the-art models in terms of speech quality and inference speed, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and possible to control the speed and pitch of the generated utterance. More importantly, our model can converge just in few hours of training, which is up to 11.2x times faster than existing methods. Furthermore, the memory requirement of our model grows linearly with sequence length, which makes it possible to predict complete articles at one time with the model. Audio samples can be found in https://bit.ly/3xguaCW. |