Fast and Accurate Action Detection in Videos With Motion-Centric Attention Model
Autor: | Wenmin Wang, Wen Gao, Jinzhuo Wang |
---|---|
Rok vydání: | 2020 |
Předmět: |
Pixel
Computer science business.industry media_common.quotation_subject Feature extraction ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION 02 engineering and technology Visualization Perception Saccade Fixation (visual) 0202 electrical engineering electronic engineering information engineering Media Technology 020201 artificial intelligence & image processing Computer vision Artificial intelligence Electrical and Electronic Engineering business Classifier (UML) media_common |
Zdroj: | IEEE Transactions on Circuits and Systems for Video Technology. 30:117-130 |
ISSN: | 1558-2205 1051-8215 |
Popis: | A key factor that makes action detection in videos different from general video classification is human-guided clues, especially motion signals. Since not all the pixels in a video are informative for action recognition, the irrelevant and redundant parts can lead to a lot of noise and be burdensome for both feature extraction and classifier training. This encourages the researchers to seek out the design of the attentive model that can dynamically focus computations on the key spatiotemporal volumes. In this paper, we propose a motion-centric attention model for action detection in videos which imitates the human perception of saccade and fixation procedures while detecting actions in a video. Specifically, we first present a strategy to generate motion-centric locations based on the density peak of motion signals, providing reliable candidates around which actions have high possibilities to occur. Then, we introduce an attention model that conducts the saccade and fixation procedures on these candidates to observe local spatiotemporal visual information, preserve internal comprehension, and produce the action proposals on temporal bounds. Afterward, a classifier with several variants is prepared to classify the action proposals and decide which one to fixate and generate the final predictions. We show how to efficiently train our model to produce fast and accurate action detection, by scanning only a small fraction of locations in a video. The extensive experiments on three challenging datasets show promising results with both accuracy and speed. |
Databáze: | OpenAIRE |
Externí odkaz: |