Discrete Acoustic Space for an Efficient Sampling in Neural Text-To-Speech

Autor:	Marek Strelec, Jonas Rohnke, Antonio Bonafonte, Mateusz Lajszczak, Trevor Wood
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	FOS: Computer and information sciences Sound (cs.SD) Computer Science - Machine Learning Audio and Speech Processing (eess.AS) FOS: Electrical engineering electronic engineering information engineering Computer Science - Sound Machine Learning (cs.LG) Electrical Engineering and Systems Science - Audio and Speech Processing
Popis:	We present a Split Vector Quantized Variational Autoencoder (SVQ-VAE) architecture using a split vector quantizer for NTTS, as an enhancement to the well-known Variational Autoencoder (VAE) and Vector Quantized Variational Autoencoder (VQ-VAE) architectures. Compared to these previous architectures, our proposed model retains the benefits of using an utterance-level bottleneck, while keeping significant representation power and a discretized latent space small enough for efficient prediction from text. We train the model on recordings in the expressive task-oriented dialogues domain and show that SVQ-VAE achieves a statistically significant improvement in naturalness over the VAE and VQ-VAE models. Furthermore, we demonstrate that the SVQ-VAE latent acoustic space is predictable from text, reducing the gap between the standard constant vector synthesis and vocoded recordings by 32%. 5 pages, 5 figures, accepted at IberSPEECH 2022
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::f2b8d1684f946b52c1dbeb86efdd84e2 http://arxiv.org/abs/2110.12539 Zobrazit plný text záznamu