Whispered and Lombard Neural Speech Synthesis

Autor: Tuomo Raitio, Tobias Bleisch, Petko N. Petkov, Qiong Hu, Varun Lakshminarasimhan, Erik Marchi
Rok vydání: 2021
Předmět:
FOS: Computer and information sciences
Computer Science - Machine Learning
Sound (cs.SD)
Computer science
Mean opinion score
Speech recognition
media_common.quotation_subject
Speech synthesis
02 engineering and technology
Intelligibility (communication)
computer.software_genre
Computer Science - Sound
Style (sociolinguistics)
Machine Learning (cs.LG)
030507 speech-language pathology & audiology
03 medical and health sciences
Audio and Speech Processing (eess.AS)
0202 electrical engineering
electronic engineering
information engineering

FOS: Electrical engineering
electronic engineering
information engineering

Active listening
Quality (business)
media_common
Signal processing
Computer Science - Computation and Language
020206 networking & telecommunications
0305 other medical science
Encoder
computer
Computation and Language (cs.CL)
Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj: SLT
DOI: 10.48550/arxiv.2101.05313
Popis: It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pre-training this system, the SV model can be used as a style encoder for generating different style embeddings as input for the Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.
Comment: To appear in SLT 2021
Databáze: OpenAIRE