Speaker-Independent Lipreading With Limited Data
Autor: | Xingxuan Zhang, Chen-Zhao Yang, Yun Zhu, Shilin Wang |
---|---|
Rok vydání: | 2020 |
Předmět: |
Scheme (programming language)
Normalization (statistics) Artificial neural network Computer science Speech recognition Feature extraction Normalization (image processing) 02 engineering and technology 010501 environmental sciences 01 natural sciences 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Hidden Markov model computer 0105 earth and related environmental sciences Block (data storage) computer.programming_language Transformer (machine learning model) |
Zdroj: | ICIP |
DOI: | 10.1109/icip40778.2020.9190780 |
Popis: | Recent researches have demonstrated that with a huge annotated training dataset, some sophisticated automatic lipreading methods perform even better than a professional human lip reader. However, when the training set is limited, i.e. containing a few number of speakers, most existing lipreading approaches cannot provide accurate recognition results for unseen speakers due to the inter-speaker variability. To improve the lipreading performance in the speaker-independent scenario, a new deep neural network (DNN) is proposed in this paper. The proposed network is composed of two parts, i.e. the Transformer-based Visual Speech Recognition Network (TVSR-Net) and the Speaker Confusion Block (SC-Block). The TVSR-Net is designed to extract lip features and recognize the speech. The SC-Block aims to achieve speaker normalization by eliminating the influence of various talking styles/habits. A Multi-Task Learning (MTL) scheme is designed for network optimization. Experiment results on the GRID dataset have demonstrated the effectiveness of the proposed network on speaker-independent recognition with limited training data. |
Databáze: | OpenAIRE |
Externí odkaz: |