Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Autor:	Hirschkind, Nameer, Yu, Xiao, Nandwana, Mahesh Kumar, Liu, Joseph, DuBois, Eloi, Le, Dao, Thiebaut, Nicolas, Sinclair, Colin, Spence, Kyle, Shang, Charles, Abrams, Zoe, McGuire, Morgan
Rok vydání:	2024
Předmět:	Computer Science - Machine Learning Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing
Druh dokumentu:	Working Paper
Popis:	We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23\% each and speaker similarity by 5\% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5$\times$ faster than real-time. Comment: Published in Interspeech 2024
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2406.10223 Zobrazit plný text záznamu View this record from Arxiv