Investigations on speaker adaptation using a continuous vocoder within recurrent neural network based text-to-speech synthesis
Autor: | Tamás Gábor Csapó, Mohammed Salah Al-Radhi, Ali Raheem Mandeel |
---|---|
Rok vydání: | 2022 |
Předmět: | |
Zdroj: | Multimedia Tools and Applications. 82:15635-15649 |
ISSN: | 1573-7721 1380-7501 |
DOI: | 10.1007/s11042-022-14005-5 |
Popis: | This paper presents an investigation of speaker adaptation using a continuous vocoder for parametric text-to-speech (TTS) synthesis. In purposes that demand low computational complexity, conventional vocoder-based statistical parametric speech synthesis can be preferable. While capable of remarkable naturalness, recent neural vocoders nonetheless fall short of the criteria for real-time synthesis. We investigate our former continuous vocoder, in which the excitation is characterized employing two one-dimensional parameters: Maximum Voiced Frequency and continuous fundamental frequency (F0). We show that an average voice can be trained for deep neural network-based TTS utilizing data from nine English speakers. We did speaker adaptation experiments for each target speaker with 400 utterances (approximately 14 minutes). We showed an apparent enhancement in the quality and naturalness of synthesized speech compared to our previous work by utilizing the recurrent neural network topologies. According to the objective studies (Mel-Cepstral Distortion and F0 correlation), the quality of speaker adaptation using Continuous Vocoder-based DNN-TTS is slightly better than the WORLD Vocoder-based baseline. The subjective MUSHRA-like test results also showed that our speaker adaptation technique is almost as natural as the WORLD vocoder using Gated Recurrent Unit and Long Short Term Memory networks. The proposed vocoder, being capable of real-time synthesis, can be used for applications which need fast synthesis speed. |
Databáze: | OpenAIRE |
Externí odkaz: |