Semi-CNN Architecture for Effective Spatio-Temporal Learning in Action Recognition

Autor: Mei Chee Leong, Dilip K. Prasad, Feng Lin, Yong Tsui Lee
Přispěvatelé: School of Mechanical and Aerospace Engineering, Interdisciplinary Graduate School (IGS), School of Computer Science and Engineering, Institute for Media Innovation (IMI)
Jazyk: angličtina
Rok vydání: 2019
Předmět:
Computer science
02 engineering and technology
transfer learning
Overfitting
lcsh:Technology
Convolutional neural network
Convolution
lcsh:Chemistry
Encoding (memory)
0202 electrical engineering
electronic engineering
information engineering

artificial_intelligence_robotics
General Materials Science
convolution network
Spatio-temporal Features
Architecture
lcsh:QH301-705.5
Instrumentation
VDP::Teknologi: 500::Informasjons- og kommunikasjonsteknologi: 550
Fluid Flow and Transfer Processes
action recognition
spatio-temporal features
lcsh:T
business.industry
Process Chemistry and Technology
General Engineering
VDP::Technology: 500::Information and communication technology: 550
Pattern recognition
Construct (python library)
021001 nanoscience & nanotechnology
lcsh:QC1-999
Computer Science Applications
lcsh:Biology (General)
lcsh:QD1-999
lcsh:TA1-2040
Action Recognition
Mechanical engineering [Engineering]
Fuse (electrical)
020201 artificial intelligence & image processing
Artificial intelligence
lcsh:Engineering (General). Civil engineering (General)
0210 nano-technology
Transfer of learning
business
lcsh:Physics
Zdroj: Applied Sciences
Volume 10
Issue 2
Applied Sciences, Vol 10, Iss 2, p 557 (2020)
Popis: This paper introduces a fusion convolutional architecture for efficient learning of spatio-temporal features in video action recognition. Unlike 2D convolutional neural networks (CNNs), 3D CNNs can be applied directly on consecutive frames to extract spatio-temporal features. The aim of this work is to fuse the convolution layers from 2D and 3D CNNs to allow temporal encoding with fewer parameters than 3D CNNs. We adopt transfer learning from pre-trained 2D CNNs for spatial extraction, followed by temporal encoding, before connecting to 3D convolution layers at the top of the architecture. We construct our fusion architecture, semi-CNN, based on three popular models: VGG-16, ResNets and DenseNets, and compare the performance with their corresponding 3D models. Our empirical results evaluated on the action recognition dataset UCF-101 demonstrate that our fusion of 1D, 2D and 3D convolutions outperforms its 3D model of the same depth, with fewer parameters and reduces overfitting. Our semi-CNN architecture achieved an average of 16&ndash
30% boost in the top-1 accuracy when evaluated on an input video of 16 frames.
Databáze: OpenAIRE