Building HMM-TTS Voices on Diverse Data

Autor:	Kayoko Yanagisawa, Langzhou Chen, Vincent Wan, Norbert Braunschweiler, Masami Akamine, Javier Latorre, Mark J. F. Gales
Rok vydání:	2014
Předmět:	business.industry Computer science media_common.quotation_subject Speech recognition Decision tree Statistical model Speech synthesis Speaker recognition Machine learning computer.software_genre Signal Processing The Internet Quality (business) Artificial intelligence Electrical and Electronic Engineering business Cluster analysis Hidden Markov model computer media_common
Zdroj:	IEEE Journal of Selected Topics in Signal Processing. 8:296-306
ISSN:	1941-0484 1932-4553
DOI:	10.1109/jstsp.2013.2295058
Popis:	The statistical models of hidden Markov model based text-to-speech (HMM-TTS) systems are typically built using homogeneous data. It is possible to acquire data from many different sources but combining them leads to a non-homogeneous or diverse dataset. This paper describes the application of average voice models (AVMs) and a novel application of cluster adaptive training (CAT) with multiple context dependent decision trees to create HMM-TTS voices using diverse data: speech data recorded in studios mixed with speech data obtained from the internet. Training AVM and CAT models on diverse data yields better quality speech than training on high quality studio data alone. Tests show that CAT is able to create a voice for a target speaker with as little as 7 seconds; an AVM would need more data to reach the same level of similarity to target speaker. Tests also show that CAT produces higher quality voices than AVMs irrespective of the amount of adaptation data. Lastly, it is shown that it is beneficial to model the data using multiple context clustering decision trees.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::1fb6777263263ad23cf96c9b4043c4dc https://doi.org/10.1109/jstsp.2013.2295058 Zobrazit plný text záznamu