Autor: |
Weizhao Zhang, Hongwu Yang |
Jazyk: |
angličtina |
Rok vydání: |
2022 |
Předmět: |
|
Zdroj: |
Applied Sciences, Vol 12, Iss 23, p 12185 (2022) |
Druh dokumentu: |
article |
ISSN: |
2076-3417 |
DOI: |
10.3390/app122312185 |
Popis: |
The paper proposes a meta-learning-based Mandarin-Tibetan cross-lingual text-to-speech (TTS) to realize both Mandarin and Tibetan speech synthesis under a unique framework. First, we build two kinds of Tacotron2-based Mandarin-Tibetan cross-lingual baseline TTS. One is a shared encoder Mandarin-Tibetan cross-lingual TTS, and another is a separate encoder Mandarin-Tibetan cross-lingual TTS. Both baseline TTS use the speaker classifier with a gradient reversal layer to disentangle speaker-specific information from the text encoder. At the same time, we design a prosody generator to extract prosodic information from sentences to explore syntactic and semantic information adequately. To further improve the synthesized speech quality of the Tacotron2-based Mandarin-Tibetan cross-lingual TTS, we propose a meta-learning-based Mandarin-Tibetan cross-lingual TTS. Based on the separate encoder Mandarin-Tibetan cross-lingual TTS, we use an additional dynamic network to predict the parameters of the language-dependent text encoder that could realize better cross-lingual knowledge sharing in the sequence-to-sequence TTS. Lastly, we synthesize Mandarin or Tibetan speech through the unique acoustic model. The baseline experimental results show that the separate encoder Mandarin-Tibetan cross-lingual TTS could handle the input of different languages better than the shared encoder Mandarin-Tibetan cross-lingual TTS. The experimental results further show that the proposed meta-learning-based Mandarin-Tibetan cross-lingual speech synthesis method could effectively improve the voice quality of synthesized speech in terms of naturalness and speaker similarity. |
Databáze: |
Directory of Open Access Journals |
Externí odkaz: |
|