StNet: Local and Global Spatial-Temporal Modeling for Action Recognition - Citegraph

Paper Info

Title
StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

Abstract
Despite the success of deep learning for static image understanding, it remains unclear what are the most effective network architectures for spatial-temporal modeling in videos. In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatial-temporal network (StNet) architecture for both local and global modeling in videos. Particularly, StNet stacks N successive video frames into a super-image which has 3N channels and applies 2D convolution on super-images to capture local spatial-temporal relationship. To model global spatial-temporal structure, we apply temporal convolution on the local spatial-temporal feature maps. Specifically, a novel temporal Xception block is proposed in StNet, which employs a separate channel-wise and temporal-wise convolution over the feature sequence of a video. Extensive experiments on the Kinetics dataset demonstrate that our framework outperforms several state-of-the-art approaches in action recognition and can strike a satisfying trade-off between recognition accuracy and model complexity. We further demonstrate the generalization performance of the leaned video representations on the UCF101 dataset.

Year	Venue	Field
2018	national conference on artificial intelligence	Architecture,Pattern recognition,Convolution,Computer science,Action recognition,Network architecture,Communication channel,Artificial intelligence,Temporal modeling,Deep learning,Model complexity
DocType	Volume	Citations
Journal	abs/1811.01549	2
PageRank	References	Authors
0.37	16	8

Authors (8 rows)

Cited by (2 rows)

References (16 rows)

Name	Order	Citations	PageRank
He, D.	1	33	13.67
Zhichao Zhou	2	3	1.40
Chuang Gan	3	253	31.92
Fu Li	4	3	2.42
Xiao Liu	5	284	41.90
Yandong Li	6	35	5.80
LiMin Wang	7	816	48.41
Shilei Wen	8	79	13.59

1