Popis: |
In this work, we propose two Deep Neural Networks, DNN-1 and DNN-2, based on residual Fast-Slow Refined Highway (FSRH) and Global Atomic Spatial Attention (GASA) to effectively recognize and detect actions. The proposed DNN-1 includes a 3D Convolutional Neural Network (3DCNN), Residual FSRH (R_FSRH), reduction layer, and classification layer for action recognition. In action detection of subject-region extraction and classification, the proposed DNN-2 consists of a 3DCNN, region proposal network, R_FSRH, GASA, and classification-localization layer. The 3DCNN takes the front layer to the “Mixed-3c” layer of the pre-trained Inflated 3D (I3D) network as the backbone structure. FSRH is composed of two Refined Highway (RH) units to extract a pair of features from fast and slow actions, where RH has temporal attention from a non-local 3D convolution and an affine transform by a temporal bilinear inception. In R_FSRH, multiple cascaded FSRHs with different residual connections were investigated to determine an effective one. GASA sequentially computes and concatenates the correlation features of an atomic subject and other subjects to effectively discover high-level semantic information. In ablation studies, extensive experiments were conducted to demonstrate the superior performance of the proposed DNN-1 and DNN-2 on five challenging video datasets of JHMDB-21, UCF101-24, Traffic Police (TP), Charades, and AVA. Notably, the proposed DNN-1 shows state-of-the-art performance of 98.6% on UCF101-24 and 98.1% on TP, and DNN-2 exhibits state-of-the-art video-mAP of 27.7% on AVA, and the second best video-mAP of 25.7% on Charades, to the best of our knowledge. Therefore, the DNN-1 and DNN-2 proposed herein can be outstanding context-aware engines for various video understanding applications. |