Audio-Visual Speech Recognition System Using Recurrent Neural Network

Autor:	Kai-Xian Lau, Yoon-Ket Lee, Yeh-Huann Goh
Rok vydání:	2019
Předmět:	Computer science Speech recognition Sentiment analysis Feature extraction 020206 networking & telecommunications Audio-visual speech recognition 02 engineering and technology Visualization Signal-to-noise ratio Recurrent neural network Robustness (computer science) Cepstrum 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Mel-frequency cepstrum Hidden Markov model
Zdroj:	2019 4th International Conference on Information Technology (InCIT).
Popis:	An audio-visual speech recognition system (AVSR) integrates audio and visual information to perform speech recognition task. The AVSR has various applications in practice especially in natural language processing systems such as speech-to-text conversion, automatic translation and sentiment analysis. Decades ago, researchers tend to use Hidden Markov Model (HMM) to construct speech recognition system due to its good achievements in success recognition rate. However, HMM’s training dataset is enormous in order to have sufficient linguistic coverage. Besides, its recognition rate under noisy environments is not satisfying. To overcome this deficiency, a Recurrent Neural Network (RNN) based AVSR is proposed. The proposed AVSR model consists of three components: 1) audio features extraction mechanism, 2) visual features extraction mechanism and 3) audio and visual features integration mechanism. The features integration mechanism combines the output features from both audio and visual extraction mechanisms to generate final classification results. In this research, the audio features mechanism is modelled by Mel-frequency Cepstrum Coefficient (MFCC) and further processed by RNN system, whereas the visual features mechanism is modelled by Haar-Cascade Detection with OpenCV and again, it is further processed by RNN system. Then, both of these extracted features were integrated by multimodal RNN-based features-integration mechanism. The performance in terms of the speech recognition rate and the robustness of the proposed AVSR system were evaluated using speech under clean environment and Signal Noise Ratio (SNR) levels ranging from −20 dB to 20 dB with 5 dB interval. On average, final speech recognition rate is 89% across different levels of SNR.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::dae57cc272a76873ae19cf12a79fe4f3 https://doi.org/10.1109/incit.2019.8912049 Zobrazit plný text záznamu