Abstract | ||
---|---|---|
Learning spatiotemporal features via 3D-CNN (3D Convolutional Neural Network) models has been regarded as an effective approach for action recognition. In this paper, we explore visual attention mechanism for video analysis and propose a novel 3D-CNN model, dubbed AE-I3D (Attention-Enhanced Inflated-3D Network), for learning attention-enhanced spatiotemporal representation. The contribution of our AE-I3D is threefold: First, we inflate soft attention in spatiotemporal scope for 3D videos, and adopt softmax to generate probability distribution of attentional features in a feedforward 3D-CNN architecture; Second, we devise an AE-Res (Attention-Enhanced Residual learning) module, which learns attention-enhanced features in a two-branch residual learning way, also the AE-Res module is lightweight and flexible, so that can be easily embedded into many 3D-CNN architectures; Finally, we embed multiple AE-Res modules into an I3D (Inflated-3D) network, yielding our AE-I3D model, which can be trained in an end-to-end, video-level manner. Different from previous attention networks, our method inflates residual attention from 2D image to 3D video for 3D attention residual learning to enhance spatiotemporal representation. We use RGB-only video data for evaluation on three benchmarks: UCF101, HMDB51, and Kinetics. The experimental results demonstrate that our AE-I3D is effective with competitive performance. |
Year | DOI | Venue |
---|---|---|
2020 | 10.1109/ACCESS.2020.2968024 | IEEE ACCESS |
Keywords | DocType | Volume |
Action recognition,video understanding,spatiotemporal representation,visual attention,3D-CNN,residual learning | Journal | 8 |
ISSN | Citations | PageRank |
2169-3536 | 1 | 0.35 |
References | Authors | |
0 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Zhensheng Shi | 1 | 6 | 2.47 |
Liangjie Cao | 2 | 1 | 0.35 |
Cheng Guan | 3 | 1 | 0.69 |
Haiyong Zheng | 4 | 20 | 8.12 |
Zhaorui Gu | 5 | 12 | 3.94 |
Zhibin Yu | 6 | 40 | 9.99 |
Bing Zheng | 7 | 75 | 23.58 |