Action Recognition Using Deep 3D CNNs with Sequential Feature Aggregation and Attention

Autor:	Fazliddin Anvarov, Dae Ha Kim, Byung Cheol Song
Rok vydání:	2020
Předmět:	Computer Networks and Communications Computer science lcsh:TK7800-8360 02 engineering and technology Convolutional neural network Field (computer science) Image (mathematics) 0202 electrical engineering electronic engineering information engineering Electrical and Electronic Engineering action recognition Feature aggregation business.industry lcsh:Electronics 020206 networking & telecommunications Pattern recognition 3D CNN Action (philosophy) Hardware and Architecture Control and Systems Engineering Filter (video) Signal Processing Action recognition 020201 artificial intelligence & image processing Artificial intelligence business deep feature attention
Zdroj:	Electronics Volume 9 Issue 1 Electronics, Vol 9, Iss 1, p 147 (2020)
ISSN:	2079-9292
Popis:	Action recognition is an active research field that aims to recognize human actions and intentions from a series of observations of human behavior and the environment. Unlike image-based action recognition mainly using a two-dimensional (2D) convolutional neural network (CNN), one of the difficulties in video-based action recognition is that video action behavior should be able to characterize both short-term small movements and long-term temporal appearance information. Previous methods aim at analyzing video action behavior only using a basic framework of 3D CNN. However, these approaches have a limitation on analyzing fast action movements or abruptly appearing objects because of the limited coverage of convolutional filter. In this paper, we propose the aggregation of squeeze-and-excitation (SE) and self-attention (SA) modules with 3D CNN to analyze both short and long-term temporal action behavior efficiently. We successfully implemented SE and SA modules to present a novel approach to video action recognition that builds upon the current state-of-the-art methods and demonstrates better performance with UCF-101 and HMDB51 datasets. For example, we get accuracies of 92.5% (16f-clip) and 95.6% (64f-clip) with the UCF-101 dataset, and 68.1% (16f-clip) and 74.1% (64f-clip) with HMDB51 for the ResNext-101 architecture in a 3D CNN.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::0b0a7cb6f20e193abaccde49c0513635 https://doi.org/10.3390/electronics9010147 Zobrazit plný text záznamu