Transfer learning of the expressivity using flow metric learning in multispeaker text-to-speech synthesis

Autor:	Ajinkya Kulkarni, Vincent Colotte, Denis Jouvet
Přispěvatelé:	Jouvet, Denis, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr)., Grid'5000, Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	Computer science [INFO.INFO-TS] Computer Science [cs]/Signal and Image Processing Mean opinion score Speech recognition deep metric learning Inference Speech synthesis 02 engineering and technology Latent variable 010501 environmental sciences [INFO] Computer Science [cs] computer.software_genre expressivity 01 natural sciences [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing 0202 electrical engineering electronic engineering information engineering [INFO]Computer Science [cs] variational autoencoder text-to-speech 0105 earth and related environmental sciences Acoustic model Speaker recognition inverse autoregressive flow Autoencoder Metric (mathematics) Embedding 020201 artificial intelligence & image processing Transfer of learning computer
Zdroj:	INTERSPEECH 2020 INTERSPEECH 2020, Oct 2020, Shanghai / Virtual, China INTERSPEECH
Popis:	International audience; In this paper, we present a novel deep metric learning architecture along with variational inference incorporated in a paramet-ric multispeaker expressive text-to-speech (TTS) system. We proposed inverse autoregressive flow (IAF) as a way to perform the variational inference, thus providing flexible approximate posterior distribution. The proposed approach condition the text-to-speech system on speaker embeddings so that latent space represents the emotion as semantic characteristics. For representing the speaker, we extracted speaker em-beddings from the x-vector based speaker recognition model trained on speech data from many speakers. To predict the vocoder features, we used the acoustic model conditioned on the textual features as well as on the speaker embedding. We transferred the expressivity by using the mean of the latent variables for each emotion to generate expressive speech in different speaker's voices for which no expressive speech data is available. We compared the results obtained using flow-based variational inference with variational autoencoder as a base-line model. The performance measured by mean opinion score (MOS), speaker MOS, and expressive MOS shows that N-pair loss based deep metric learning along with IAF model improves the transfer of expressivity in the desired speaker's voice in synthesized speech.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::66f3de9ad2a382e71bb19836d45dc112 https://hal.inria.fr/hal-02572106v3 Zobrazit plný text záznamu