Deep Speech Synthesis from Multimodal Articulatory Representations

Autor:	Wu, Peter, Yu, Bohan, Scheck, Kevin, Black, Alan W, Krishnapriyan, Aditi S., Chen, Irene Y., Schultz, Tanja, Watanabe, Shinji, Anumanchipalli, Gopala K.
Rok vydání:	2024
Předmět:	Electrical Engineering and Systems Science - Audio and Speech Processing Computer Science - Sound
Druh dokumentu:	Working Paper
Popis:	The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intelligibility of synthesized outputs improves noticeably. For example, compared to prior work, utilizing our proposed transfer learning methods improves the MRI-to-speech performance by 36% word error rate. In addition to these intelligibility results, our multimodal pre-trained models consistently outperform unimodal baselines on three objective and subjective synthesis quality metrics.
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2412.13387 Zobrazit plný text záznamu View this record from Arxiv