Abstract | ||
---|---|---|
Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatial-temporal network (StNet) architecture for both local and global modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatial-temporal structure, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet, which employs a separate channel-wise and temporal-wise convolution over the feature sequence of a video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset. |
Year | Venue | Field |
---|---|---|
2018 | national conference on artificial intelligence | Architecture,Pattern recognition,Convolution,Computer science,Action recognition,Network architecture,Communication channel,Artificial intelligence,Temporal modeling,Deep learning,Model complexity |
DocType | Volume | Citations |
Journal | abs/1811.01549 | 2 |
PageRank | References | Authors |
0.37 | 16 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
He, D. | 1 | 33 | 13.67 |
Zhichao Zhou | 2 | 3 | 1.40 |
Chuang Gan | 3 | 253 | 31.92 |
Fu Li | 4 | 3 | 2.42 |
Xiao Liu | 5 | 284 | 41.90 |
Yandong Li | 6 | 35 | 5.80 |
LiMin Wang | 7 | 816 | 48.41 |
Shilei Wen | 8 | 79 | 13.59 |