Convolutional Networks With Channel and STIPs Attention Model for Action Recognition in Videos
Autor: | Yibin Li, Xin Ma, Hanbo Wu |
---|---|
Rok vydání: | 2020 |
Předmět: |
business.industry
Computer science Feature extraction Pooling ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION Pattern recognition 02 engineering and technology Convolutional neural network Computer Science Applications Discriminative model Signal Processing 0202 electrical engineering electronic engineering information engineering Media Technology Feature (machine learning) RGB color model 020201 artificial intelligence & image processing Artificial intelligence Electrical and Electronic Engineering Focus (optics) business Communication channel |
Zdroj: | IEEE Transactions on Multimedia. 22:2293-2306 |
ISSN: | 1941-0077 1520-9210 |
DOI: | 10.1109/tmm.2019.2953814 |
Popis: | With the help of convolutional neural networks (CNNs), video-based human action recognition has made significant progress. CNN features that are spatial and channel-wise can provide rich information for powerful image description. However, CNNs lack the ability to process the long-term temporal dependency of an entire video and further cannot well focus on the informative motion regions of actions. Aiming at the two problems, we propose a novel video-based action recognition framework in this paper. We first represent videos with dynamic image sequences (DISs), which effectively describe videos by modeling the local spatial-temporal dynamics and dependencies. Then a channel and spatial-temporal interest points (STIPs) attention model (CSAM) based on CNNs is proposed to focus on the discriminative channels in networks and the informative spatial motion regions of human actions. Specifically, channel attention (CA) is implemented by automatically learning channel-wise convolutional features and assigning different weights for different channels. STIPs attention (SA) is encoded by projecting the detected STIPs on frames of dynamic image sequences into the corresponding convolutional feature map space. The proposed CSAM is embedded after CNN convolutional layers to refine the feature maps, followed by global average pooling to produce effective feature representations for videos. Finally frame-level video representations are fed into an LSTM to capture the temporal dependencies and make classification. Experiments on three challenging RGB-D datasets show that our method has better performance and outperforms the state-of-the-art approaches with only depth data. |
Databáze: | OpenAIRE |
Externí odkaz: |