Binaural Selective Attention Model for Target Speaker Extraction

Autor:	Meng, Hanyu, Zhang, Qiquan, Zhang, Xiangyu, Sethu, Vidhyasaharan, Ambikairajah, Eliathamby
Rok vydání:	2024
Předmět:	Electrical Engineering and Systems Science - Audio and Speech Processing Computer Science - Sound Electrical Engineering and Systems Science - Signal Processing
Druh dokumentu:	Working Paper
Popis:	The remarkable ability of humans to selectively focus on a target speaker in cocktail party scenarios is facilitated by binaural audio processing. In this paper, we present a binaural time-domain Target Speaker Extraction model based on the Filter-and-Sum Network (FaSNet). Inspired by human selective hearing, our proposed model introduces target speaker embedding into separators using a multi-head attention-based selective attention block. We also compared two binaural interaction approaches -- the cosine similarity of time-domain signals and inter-channel correlation in learned spectral representations. Our experimental results show that our proposed model outperforms monaural configurations and state-of-the-art multi-channel target speaker extraction models, achieving best-in-class performance with 18.52 dB SI-SDR, 19.12 dB SDR, and 3.05 PESQ scores under anechoic two-speaker test configurations. Comment: Accepted by INTERSPEECH2024
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2406.12236 Zobrazit plný text záznamu View this record from Arxiv