Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading

Autor:	Feng Cheng, Xingxuan Zhang, Wang Shi-lin
Rok vydání:	2019
Předmět:	Training set Machine translation Computer science business.industry Feature extraction Pattern recognition 02 engineering and technology Viseme 010501 environmental sciences computer.software_genre 01 natural sciences 0202 electrical engineering electronic engineering information engineering Feature (machine learning) 020201 artificial intelligence & image processing Sequence learning Artificial intelligence business Hidden Markov model computer 0105 earth and related environmental sciences Block (data storage)
Zdroj:	ICCV
DOI:	10.1109/iccv.2019.00080
Popis:	Current state-of-the-art approaches for lip reading are based on sequence-to-sequence architectures that are designed for natural machine translation and audio speech recognition. Hence, these methods do not fully exploit the characteristics of the lip dynamics, causing two main drawbacks. First, the short-range temporal dependencies, which are critical to the mapping from lip images to visemes, receives no extra attention. Second, local spatial information is discarded in the existing sequence models due to the use of global average pooling (GAP). To well solve these drawbacks, we propose a Temporal Focal block to sufficiently describe short-range dependencies and a Spatio-Temporal Fusion Module (STFM) to maintain the local spatial information and to reduce the feature dimensions as well. From the experiment results, it is demonstrated that our method achieves comparable performance with the state-of-the-art approach using much less training data and much lighter Convolutional Feature Extractor. The training time is reduced by 12 days due to the convolutional structure and the local self-attention mechanism.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::5723aa864f42ab7f39a0698913677fde https://doi.org/10.1109/iccv.2019.00080 Zobrazit plný text záznamu