Multi-Level Speaker Representation for Target Speaker Extraction

Autor:	Zhang, Ke, Li, Junjie, Wang, Shuai, Wei, Yangjie, Wang, Yi, Wang, Yannan, Li, Haizhou
Rok vydání:	2024
Předmět:	Electrical Engineering and Systems Science - Audio and Speech Processing Computer Science - Sound
Druh dokumentu:	Working Paper
Popis:	Target speaker extraction (TSE) relies on a reference cue of the target to extract the target speech from a speech mixture. While a speaker embedding is commonly used as the reference cue, such embedding pre-trained with a large number of speakers may suffer from confusion of speaker identity. In this work, we propose a multi-level speaker representation approach, from raw features to neural embeddings, to serve as the speaker reference cue. We generate a spectral-level representation from the enrollment magnitude spectrogram as a raw, low-level feature, which significantly improves the model's generalization capability. Additionally, we propose a contextual embedding feature based on cross-attention mechanisms that integrate frame-level embeddings from a pre-trained speaker encoder. By incorporating speaker features across multiple levels, we significantly enhance the performance of the TSE model. Our approach achieves a 2.74 dB improvement and a 4.94% increase in extraction accuracy on Libri2mix test set over the baseline. Comment: 5 pages. Submitted to ICASSP 2025. Implementation will be released at https://github.com/wenet-e2e/wesep
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2410.16059 Zobrazit plný text záznamu View this record from Arxiv