Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

Autor:	Masuyama, Yoshiki, Chang, Xuankai, Zhang, Wangyou, Cornell, Samuele, Wang, Zhong-Qiu, Ono, Nobutaka, Qian, Yanmin, Watanabe, Shinji
Rok vydání:	2023
Předmět:	Computer Science - Sound Computer Science - Computation and Language Electrical Engineering and Systems Science - Audio and Speech Processing
Druh dokumentu:	Working Paper
Popis:	Neural speech separation has made remarkable progress and its integration with automatic speech recognition (ASR) is an important direction towards realizing multi-speaker ASR. This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end. In detail, we explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model. We employ the recent self-supervised learning representation (SSLR) as a feature and improve the recognition performance from the case with filterbank features. To further improve multi-speaker recognition performance, we present a carefully designed training strategy for integrating speech separation and recognition with SSLR. The proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set, significantly outperforming an existing mask-based MVDR beamforming and filterbank integration (28.9%). Comment: Accepted to IEEE WASPAA 2023
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2307.12231 Zobrazit plný text záznamu View this record from Arxiv