Enhancing audio quality for expressive Neural Text-to-Speech

Autor:	Adam Gabrys, Kamil Pokora, Daniel Saez-Trigueros, Viacheslav Klimkov, Jaime Lorenzo-Trueba, Jakub Lachowicz, Abdelhamid Ezzerg, Bartosz Putrycz, Daniel Korzekwa, David McHardy
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	Computer science Computer Science - Artificial Intelligence media_common.quotation_subject Speech recognition Acoustic model Speech synthesis MUSHRA computer.software_genre Naturalness Autoregressive model Quality (business) Sound quality Set (psychology) computer media_common Electrical Engineering and Systems Science - Audio and Speech Processing
Popis:	Artificial speech synthesis has made a great leap in terms of naturalness as recent Text-to-Speech (TTS) systems are capable of producing speech with similar quality to human recordings. However, not all speaking styles are easy to model: highly expressive voices are still challenging even to recent TTS architectures since there seems to be a trade-off between expressiveness in a generated audio and its signal quality. In this paper, we present a set of techniques that can be leveraged to enhance the signal quality of a highly-expressive voice without the use of additional data. The proposed techniques include: tuning the autoregressive loop's granularity during training; using Generative Adversarial Networks in acoustic modelling; and the use of Variational Auto-Encoders in both the acoustic model and the neural vocoder. We show that, when combined, these techniques greatly closed the gap in perceived naturalness between the baseline system and recordings by 39% in terms of MUSHRA scores for an expressive celebrity voice. Comment: 6 pages, 4 figures, 2 tables, SSW 2021
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::0f3066dd01521b970701004da0d2159f http://arxiv.org/abs/2108.06270 Zobrazit plný text záznamu