Multi-Channel Speech Enhancement and Amplitude Modulation Analysis for Noise Robust Automatic Speech Recognition
Autor: | Stefan Goetze, Birger Kollmeier, Jörn Anemüller, Kamil Adiloglu, Niko Moritz |
---|---|
Přispěvatelé: | Publica |
Jazyk: | angličtina |
Rok vydání: | 2017 |
Předmět: |
Voice activity detection
Microphone Computer science Speech recognition Word error rate 020206 networking & telecommunications 02 engineering and technology Filter bank Speech processing Theoretical Computer Science Human-Computer Interaction Speech enhancement 030507 speech-language pathology & audiology 03 medical and health sciences Recurrent neural network 0202 electrical engineering electronic engineering information engineering Language model 0305 other medical science Software |
Popis: | The paper describes a system for automatic speech recognition (ASR) that is benchmarked with data of the 3rd CHiME challenge, a dataset comprising distant microphone recordings of noisy acoustic scenes in public environments. The proposed ASR system employs various methods to increase recognition accuracy and noise robustness. Two different multi-channel speech enhancement techniques are used to eliminate interfering sounds in the audio stream. One speech enhancement method aims at separating the target speaker's voice from background sources based on non-negative matrix factorization (NMF) using variational Bayesian (VB) inference to estimate NMF parameters. The second technique is based on a time-varying minimum variance distortionless response (MVDR) beamformer that uses spatial information to suppress sound signals not arriving from a desired direction. Prior to speech enhancement, a microphone channel failure detector is applied that is based on cross-comparing channels using a modulation-spectral representation of the speech signal. ASR feature extraction employs the amplitude modulation filter bank (AMFB) that implicates prior information of speech to analyze its temporal dynamics. AMFBs outperform the commonly used frame splicing technique of filter bank features in conjunction with a deep neural network (DNN) based ASR system, which denotes an equivalent data-driven approach to extract modulation-spectral information. In addition, features are speaker adapted, a recurrent neural network (RNN) is employed for language modeling, and hypotheses of different ASR systems are combined to further enhance the recognition accuracy. The proposed ASR system achieves an absolute word error rate (WER) of 5.67% on the real evaluation test data, which is 0.16% lower compared to the best score reported within the 3rd CHiME challenge. |
Databáze: | OpenAIRE |
Externí odkaz: |