Human Uncivilized Behavior Detection Method Integrating Non-uniform Sampling and Feature Enhancement

Autor: YE Hao, WANG Longye, ZENG Xiaoli, XIAO Yue
Jazyk: čínština
Rok vydání: 2024
Předmět:
Zdroj: Jisuanji kexue yu tansuo, Vol 18, Iss 12, Pp 3219-3234 (2024)
Druh dokumentu: article
ISSN: 1673-9418
DOI: 10.3778/j.issn.1673-9418.2401064
Popis: In order to solve the problems of misdetection of similar behaviors and low accuracy for detecting local body behaviors in the spatio-temporal action detection of abnormal human behavior, based on the self-made uncivilized behavior spatio-temporal action detection dataset (UBSAD), a method that integrates non-uniform sampling and feature enhancement is proposed. Firstly, this method incorporates the video swin transformer (VST) as the backbone network in the spatio-temporal feature extraction stage to capture long-term temporal dependencies in videos, and enhance the network’s global information learning capability. Additionally, a ringed residual VST block replaces the standard VST block in the final stage of the backbone network, enlarging the difference between target area and background area. Combined with the multi-head self-attention mechanism, the feature extraction of the target area is strengthened. Furthermore, during the video frame collection stage, a unique non-uniform sampling method is proposed to adjust the input data distribution according to task requirements, allowing the model to obtain action change information in a hierarchical manner, effectively improving the network’s attention to detailed features of similar behaviors. Finally, after the feature extraction network, a new cascaded pooling three-dimensional spatial pyramid feature enhancement module incorporating shallow features is embedded to further enhance feature applicability at various scales, reduce the loss of detailed motion information during the feature extraction process, reduce the interference of background information, and achieve the effect of feature enhancement. Experimental results show that the method achieves mAP of 71.93% and 83.09% respectively on the UBSAD dataset and the public dataset UCF101-24. They are 7.39 percentage points and 1.22 percentage points higher than those of using the baseline network VST as the spatio-temporal feature extraction model, demonstrating the method’s effectiveness in accurately detecting behavior.
Databáze: Directory of Open Access Journals