Abstrakt: |
Automatic human behavior monitoring is essential for surveillance cameras in public and private environments. Violent action is challenging because the available violence dataset is insufficient for deep network training. Also, human behavior contains high intra-class variations and inter-class similarities that make violence detection very challenging. In this paper, we proposed an unsupervised Spatial–Temporal Action Translation (STAT) network to accurately distinguish between behaviors and overcome the insufficient violence data problem. Our framework comprises a person detector, motion feature extractor, STAT network, and output interpretation. The proposed framework performed well in different environments because it detects objects in each frame and removes irrelevant background information. As violent motion pattern changes rapidly with high velocity, temporal features play a crucial role in the recognition, and we use it as the input of the STAT network. The STAT network has been trained with normal behavior data, translating normal motion to the spatial frame. Due to the complicated actions in violent behavior, the STAT network cannot reconstruct the violent frame correctly, and therefore, actions will be categorized by comparing the actual and reconstructed frames and measuring the reconstruction error in the output interpretation part of the framework. The proposed unsupervised framework achieved comparable accuracy and outperformed previous works in terms of generality. [ABSTRACT FROM AUTHOR] |