Autor: |
Leelavathy, N., Lohith, S., Jeswanthi, A., Ramya, L., Anjali, N. |
Předmět: |
|
Zdroj: |
AIP Conference Proceedings; 2023, Vol. 2492 Issue 1, p1-5, 5p |
Abstrakt: |
Reading lips (i.e., extracting phonemes from lip visuals) can be difficult, especially in noisy videos. To solve this problem, we created a deep-learning algorithm to read lips. We opt a convolutional neural network on the video frames themselves, due to the success of CNNs as image classifiers in the past. In the end, we created a model that can identify the correct phoneme spoken 48% of the time by looking at only images, which is close to peak human performance. Here we present various methods to predict words and phrases from only video without any audio signal. We employ a VGGNet pre-trained on human faces of celebrities from IMDB and Google Images, and explore different ways of using it to handle these image sequences. The VGGNet is trained on images concatenated from multiple frames in each sequence, as well as used in conjunction with LSTMs for extracting temporal information. While the LSTM models fail to outperform other methods for a variety of reasons, the concatenated image model that uses nearest-neighbour interpolation performed well, achieving validation accuracy. [ABSTRACT FROM AUTHOR] |
Databáze: |
Complementary Index |
Externí odkaz: |
|