Muse: Multi-modal target speaker extraction with visual cues

Autor:	Chenglin Xu, Ruijie Tao, Zexu Pan, Haizhou Li
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	FOS: Computer and information sciences Signal processing Sound (cs.SD) Computer science Speech recognition Image and Video Processing (eess.IV) Inference Electrical Engineering and Systems Science - Image and Video Processing Synchronization Computer Science - Sound Visualization Multimedia (cs.MM) Modal Audio and Speech Processing (eess.AS) FOS: Electrical engineering electronic engineering information engineering Focus (optics) Sensory cue PESQ Computer Science - Multimedia Electrical Engineering and Systems Science - Audio and Speech Processing
Zdroj:	ICASSP
Popis:	Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention. Such a reference speech is typically pre-recorded. On the other hand, the temporal synchronization between speech and lip movement also serves as an informative cue. Motivated by this idea, we study a novel technique to use speech-lip visual cues to extract reference target speech directly from mixture speech during inference time, without the need of pre-recorded reference speech. We propose a multi-modal speaker extraction network, named MuSE, that is conditioned only on a lip image sequence. MuSE not only outperforms other competitive baselines in terms of SI-SDR and PESQ, but also shows consistent improvement in cross-dataset evaluations. Accepted by ICASSP2021
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::f745a12dfc999ceb73baf63dc2166519 http://arxiv.org/abs/2010.07775 Zobrazit plný text záznamu