Autor: |
Handa, Anand, Agarwal, Rashi, Kohli, Narendra |
Předmět: |
|
Zdroj: |
Multimedia Tools & Applications; Jul2020, Vol. 79 Issue 27/28, p20461-20481, 21p |
Abstrakt: |
The spoken keyword recognition and its localization are one of the fundamental aspects of speech recognition and known as keyword spotting. In automatic keyword spotting systems, the Lip-reading (LR) methods have a broader role when audio data is not present or has corrupted information. The available works from the literature have focussed on recognizing a limited number of words or phrases and require the cropped region of face or lip. Whereas the proposed model does not require the cropping of the video frames and it is recognition free. The proposed model is utilizing Convolutional Neural Networks and Long Short Term Memory networks to improve the overall performance. The model creates a 128-dimensional subspace to represent the feature vectors for speech signals and corresponding lip movements (focused viseme sequences). Thus the proposed model can tackle lip reading as an unconstrained natural speech signal in the video sequences. In the experiments, different standard datasets as LRW (Oxford-BBC), MIRACL-VC1, OuluVS, GRID, and CUAVE are used for the evaluation of the proposed model. The experiments also have a comparative analysis of the proposed model with current state-of-the-art methods for Lip-Reading task and keyword spotting task. The proposed model obtain excellent results for all datasets under consideration. [ABSTRACT FROM AUTHOR] |
Databáze: |
Complementary Index |
Externí odkaz: |
|