What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS

Autor:	Laurent Girin, Brooke Stephenson, Thomas Hueber, Laurent Besacier
Přispěvatelé:	Groupe d’Étude en Traduction Automatique/Traitement Automatisé des Langues et de la Parole (GETALP), Laboratoire d'Informatique de Grenoble (LIG), Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA), GIPSA - Cognitive Robotics, Interactive Systems, & Speech Processing (GIPSA-CRISSP), GIPSA Pôle Parole et Cognition (GIPSA-PPC), Grenoble Images Parole Signal Automatique (GIPSA-lab), Université Grenoble Alpes (UGA)-Grenoble Images Parole Signal Automatique (GIPSA-lab), Institut Universitaire de France (IUF), Ministère de l'Education nationale, de l’Enseignement supérieur et de la Recherche (M.E.N.E.S.R.), ANR-19-P3IA-0003,MIAI,MIAI @ Grenoble Alpes(2019)
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	FOS: Computer and information sciences Computer Science - Computation and Language incremental speech synthesis Computer science Speech recognition Contrast (statistics) 020206 networking & telecommunications Context (language use) Speech synthesis 02 engineering and technology MUSHRA computer.software_genre [INFO.INFO-CL]Computer Science [cs]/Computation and Language [cs.CL] representation learning deep neural networks Audio and Speech Processing (eess.AS) FOS: Electrical engineering electronic engineering information engineering 0202 electrical engineering electronic engineering information engineering Computation and Language (cs.CL) Encoder Feature learning computer Sentence Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj:	Interspeech 2020-21st Annual Conference of the International Speech Communication Association Interspeech 2020-21st Annual Conference of the International Speech Communication Association, Oct 2020, Shanghai (Virtual Conf), China INTERSPEECH
Popis:	In incremental text to speech synthesis (iTTS), the synthesizer produces an audio output before it has access to the entire input sentence. In this paper, we study the behavior of a neural sequence-to-sequence TTS system when used in an incremental mode, i.e. when generating speech output for token n, the system has access to n + k tokens from the text sequence. We first analyze the impact of this incremental policy on the evolution of the encoder representations of token n for different values of k (the lookahead parameter). The results show that, on average, tokens travel 88% of the way to their full context representation with a one-word lookahead and 94% after 2 words. We then investigate which text features are the most influential on the evolution towards the final representation using a random forest analysis. The results show that the most salient factors are related to token length. We finally evaluate the effects of lookahead k at the decoder level, using a MUSHRA listening test. This test shows results that contrast with the above high figures: speech synthesis quality obtained with 2 word-lookahead is significantly lower than the one obtained with the full sentence. Comment: 5 pages, 4 figures
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::38b521a6c4a64b1a6a30beb3e6eb777d https://hal.archives-ouvertes.fr/hal-02962234/file/What_the_future_brings___Interspeech-4.pdf Zobrazit plný text záznamu