Popis: |
Visual Speech Recognition (VSR), commonly referred to as automated lip-reading, is an emerging technology that interprets speech by visually analyzing lip movements. A challenge in VSR where visually distinct words produce similar lip movements is known as homopheme problem. Visemes are the basic visual units of speech that are produced by the lip movements and positions. Furthermore, visemes are typically having shorter durations than words. Consequently, there is less temporal information for distinguishing between different viseme classes, leading to increased visual ambiguity during classification. To address this challenge, viseme classification must not only extract lip image spatial features, but also to handle visemes of varying durations and temporal features. Therefore, this study proposed a new deep learning approach SlowFast-TCN. SlowFast network is used as the frontend architecture to extract the spatio-temporal features of the slow and fast pathways. Temporal Convolutional Network (TCN) is used as the backend architecture to learn the features from the frontend to perform the classification. A comparative ablation analysis to dissect each component of the proposed SlowFast-TCN is performed to evaluate the impact of each component. This study utilizes a benchmark dataset, Lip Reading in Wild (LRW), that focuses on English language. Two subsets of the LRW dataset, comprising of homopheme words and unique words, represent the homophemic and non-homophemic dataset, respectively. The proposed approach is evaluated on varying lighting conditions to assess its performance in real-world scenarios. It was found that illumination can significantly affect the visual data. Key performance metrics, such as accuracy and loss are used to evaluate the effectiveness of the proposed approach. The proposed approach outperforms traditional baseline models in accuracy while maintaining competitive execution time. Its dual-pathway architecture effectively captures both long-term dependencies and short-term motions, leading to better performance in both homophemic and non-homophemic datasets. However, it is less robust when dealing with non-ideal lighting scenarios, indicating the need for further enhancements to handle diverse lighting scenarios. Doi: 10.28991/ESJ-2024-08-06-024 Full Text: PDF |