High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency

Autor:	June Sig Sung, Spyros Raptis, Pirros Tsiakoulis, Nikolaos Ellinas, Aimilios Chalamandaris, Hyoungmin Park, Georgia Maniati, Georgios Vamvoukakis, Panos Kakoulidis, Konstantinos Markopoulos
Rok vydání:	2020
Předmět:	FOS: Computer and information sciences Mobile processor Sound (cs.SD) Computer Science - Machine Learning Computer Science - Computation and Language Computer science Latency (audio) Acoustic model Speech synthesis computer.software_genre Computer Science - Sound Machine Learning (cs.LG) Autoregressive model Computer engineering Audio and Speech Processing (eess.AS) Feature (computer vision) FOS: Electrical engineering electronic engineering information engineering Waveform Latency (engineering) Computation and Language (cs.CL) computer Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj:	INTERSPEECH
DOI:	10.21437/interspeech.2020-2464
Popis:	This paper presents an end-to-end text-to-speech system with low latency on a CPU, suitable for real-time applications. The system is composed of an autoregressive attention-based sequence-to-sequence acoustic model and the LPCNet vocoder for waveform generation. An acoustic model architecture that adopts modules from both the Tacotron 1 and 2 models is proposed, while stability is ensured by using a recently proposed purely location-based attention mechanism, suitable for arbitrary sentence length generation. During inference, the decoder is unrolled and acoustic feature generation is performed in a streaming manner, allowing for a nearly constant latency which is independent from the sentence length. Experimental results show that the acoustic model can produce feature sequences with minimal latency about 31 times faster than real-time on a computer CPU and 6.5 times on a mobile CPU, enabling it to meet the conditions required for real-time applications on both devices. The full end-to-end system can generate almost natural quality speech, which is verified by listening tests. Proceedings of INTERSPEECH 2020
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::13c056ca9a78a01703950c53571922b5 https://doi.org/10.21437/interspeech.2020-2464 Zobrazit plný text záznamu