Abstrakt: |
Detecting the speaking modes of human is an important cue in many applications, including detecting active/inactive participants in video conferencing, monitoring students' attention in classrooms or online, analyzing students' engagement in live video lectures, and identifying drivers' distractions. However, automatically detecting speaking mode from a video is challenging due to the low resolution of the images, noise, illumination change, and unfavorable viewing conditions. This paper proposes a deep learning-based ensemble technique (called V-ensemble) to identify speaking modes, i.e., talking and non-talking, considering low-resolution and noisy images. This work also introduces an automatic algorithm for the video stream-to-image frame acquisition and develops three datasets for this research (LLLR, YawDD-M, and SBD-M). The proposed system integrated mouth region extraction and mouth state detection modules. A multi-task cascaded neural network (MTCNN) is used to extract the mouth region. Eight popular deep learning approaches, such as ResNet18, ResNet35, ResNet50, VGG16, VGG19, CNN, InceptionV3 and SVM have been investigated to select the best models for the mouth state prediction. Experimental results with a rigorous comparative analysis showed that the proposed ensemble classifier achieved the highest accuracy on three datasets: LLLR (96.80%), YawDD-M (96.69%) and SBD-M (96.90%). [ABSTRACT FROM AUTHOR] |