Shuffle-invariant Network for Action Recognition in Videos
Autor: | Qinghongya Shi, Hong-Bo Zhang, Zhe Li, Ji-Xiang Du, Qing Lei, Jing-Hua Liu |
---|---|
Rok vydání: | 2022 |
Předmět: | |
Zdroj: | ACM Transactions on Multimedia Computing, Communications, and Applications. 18:1-18 |
ISSN: | 1551-6865 1551-6857 |
DOI: | 10.1145/3485665 |
Popis: | The local key features in video are important for improving the accuracy of human action recognition. However, most end-to-end methods focus on global feature learning from videos, while few works consider the enhancement of the local information in a feature. In this article, we discuss how to automatically enhance the ability to discriminate the local information in an action feature and improve the accuracy of action recognition. To address these problems, we assume that the critical level of each region for the action recognition task is different and will not change with the region location shuffle. We therefore propose a novel action recognition method called the shuffle-invariant network. In the proposed method, the shuffled video is generated by regular region cutting and random confusion to enhance the input data. The proposed network adopts the multitask framework, which includes one feature backbone network and three task branches: local critical feature shuffle-invariant learning, adversarial learning, and an action classification network. To enhance the local features, the feature response of each region is predicted by a local critical feature learning network. To train this network, an L 1-based critical feature shuffle-invariant loss is defined to ensure that the ordered feature response list of these regions remains unchanged after region location shuffle. Then, the adversarial learning is applied to eliminate the noise caused by the region shuffle. Finally, the action classification network combines these two tasks to jointly guide the training of the feature backbone network and obtain more effective action features. In the testing phase, only the action classification network is applied to identify the action category of the input video. We verify the proposed method on the HMDB51 and UCF101 action datasets. Several ablation experiments are constructed to verify the effectiveness of each module. The experimental results show that our approach achieves the state-of-the-art performance. |
Databáze: | OpenAIRE |
Externí odkaz: |