Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention
Autor: | Fazliddin Anvarov, Dae Ha Kim, Byung Cheol Song |
---|---|
Rok vydání: | 2020 |
Předmět: |
Computer Networks and Communications
Computer science lcsh:TK7800-8360 02 engineering and technology Convolutional neural network Field (computer science) Image (mathematics) 0202 electrical engineering electronic engineering information engineering Electrical and Electronic Engineering action recognition Feature aggregation business.industry lcsh:Electronics 020206 networking & telecommunications Pattern recognition 3D CNN Action (philosophy) Hardware and Architecture Control and Systems Engineering Filter (video) Signal Processing Action recognition 020201 artificial intelligence & image processing Artificial intelligence business deep feature attention |
Zdroj: | Electronics Volume 9 Issue 1 Electronics, Vol 9, Iss 1, p 147 (2020) |
ISSN: | 2079-9292 |
Popis: | Action recognition is an active research field that aims to recognize human actions and intentions from a series of observations of human behavior and the environment. Unlike image-based action recognition mainly using a two-dimensional (2D) convolutional neural network (CNN), one of the difficulties in video-based action recognition is that video action behavior should be able to characterize both short-term small movements and long-term temporal appearance information. Previous methods aim at analyzing video action behavior only using a basic framework of 3D CNN. However, these approaches have a limitation on analyzing fast action movements or abruptly appearing objects because of the limited coverage of convolutional filter. In this paper, we propose the aggregation of squeeze-and-excitation (SE) and self-attention (SA) modules with 3D CNN to analyze both short and long-term temporal action behavior efficiently. We successfully implemented SE and SA modules to present a novel approach to video action recognition that builds upon the current state-of-the-art methods and demonstrates better performance with UCF-101 and HMDB51 datasets. For example, we get accuracies of 92.5% (16f-clip) and 95.6% (64f-clip) with the UCF-101 dataset, and 68.1% (16f-clip) and 74.1% (64f-clip) with HMDB51 for the ResNext-101 architecture in a 3D CNN. |
Databáze: | OpenAIRE |
Externí odkaz: |