Meta-Learning for Mandarin-Tibetan Cross-Lingual Speech Synthesis

Autor:	Weizhao Zhang, Hongwu Yang
Jazyk:	angličtina
Rok vydání:	2022
Předmět:	meta-learning Mandarin-Tibetan cross-lingual speech synthesis Tibetan speech synthesis Technology Engineering (General). Civil engineering (General) TA1-2040 Biology (General) QH301-705.5 Physics QC1-999 Chemistry QD1-999
Zdroj:	Applied Sciences, Vol 12, Iss 23, p 12185 (2022)
Druh dokumentu:	article
ISSN:	2076-3417
DOI:	10.3390/app122312185
Popis:	The paper proposes a meta-learning-based Mandarin-Tibetan cross-lingual text-to-speech (TTS) to realize both Mandarin and Tibetan speech synthesis under a unique framework. First, we build two kinds of Tacotron2-based Mandarin-Tibetan cross-lingual baseline TTS. One is a shared encoder Mandarin-Tibetan cross-lingual TTS, and another is a separate encoder Mandarin-Tibetan cross-lingual TTS. Both baseline TTS use the speaker classifier with a gradient reversal layer to disentangle speaker-specific information from the text encoder. At the same time, we design a prosody generator to extract prosodic information from sentences to explore syntactic and semantic information adequately. To further improve the synthesized speech quality of the Tacotron2-based Mandarin-Tibetan cross-lingual TTS, we propose a meta-learning-based Mandarin-Tibetan cross-lingual TTS. Based on the separate encoder Mandarin-Tibetan cross-lingual TTS, we use an additional dynamic network to predict the parameters of the language-dependent text encoder that could realize better cross-lingual knowledge sharing in the sequence-to-sequence TTS. Lastly, we synthesize Mandarin or Tibetan speech through the unique acoustic model. The baseline experimental results show that the separate encoder Mandarin-Tibetan cross-lingual TTS could handle the input of different languages better than the shared encoder Mandarin-Tibetan cross-lingual TTS. The experimental results further show that the proposed meta-learning-based Mandarin-Tibetan cross-lingual speech synthesis method could effectively improve the voice quality of synthesized speech in terms of naturalness and speaker similarity.
Databáze:	Directory of Open Access Journals
Externí odkaz:	https://doaj.org/article/76b8a90a6b5c431b95220a591a67f18e Zobrazit plný text záznamu View record in DOAJ