Enhanced spatio-temporal 3D CNN for facial expression classification in videos.

Autor: Khanna, Deepanshu, Jindal, Neeru, Rana, Prashant Singh, Singh, Harpreet
Předmět:
Zdroj: Multimedia Tools & Applications; Jan2024, Vol. 83 Issue 4, p9911-9928, 18p
Abstrakt: This article proposes a hybrid network model for video-based human facial expression recognition (FER) system consisting of an end-to-end 3D deep convolutional neural networks. The proposed network combines two commonly used deep 3-dimensional Convolutional Neural Networks (3D CNN) models, ResNet-50 and DenseNet-121, in an end-to-end manner with slight modifications. Currently, various methodologies exist for FER, such as 2-dimensional Convolutional Neural Networks (2D CNN), 2D CNN-Recurrent Neural Networks, 3D CNN, and features extracting algorithms such as PCA and Histogram of oriented gradients (HOG) combined with machine learning classifiers. For the proposed model, we choose 3D CNN over other methods since they preserve temporal information of the videos, unlike 2D CNN. Moreover, these aren't labor-intensive such as various handcrafted feature extracting methods. The proposed system relies on the temporal averaging of information from frame sequences of the video. The databases are pre-processed to remove unwanted backgrounds for training 3D deep CNN from scratch. Initially, feature vectors from video frame sequences are extracted using the 3D ResNet model. These feature vectors are fed to the 3D DenseNet model's blocks, which are then used to classify the predicted emotion. The model is evaluated on three benchmarking databases: Ravdess, CK + , and BAUM1s, which achieved 91.69%, 98.61%, and 73.73% accuracy for the respective databases and outperformed various existing methods. We prove that the proposed architecture works well even for the classes with less amount of training data where many existing 3D CNN networks fail. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index