Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition. - Citegraph

Paper Info

Title
Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition.

Abstract
In this report, our approach to tackling the task of ActivityNet 2018 Kinetics-600 challenge is described in detail. Though spatial-temporal modelling methods, which adopt either such end-to-end framework as I3D cite{i3d} or two-stage frameworks (i.e., CNN+RNN), have been proposed in existing state-of-the-arts for this task, video modelling is far from being well solved. In this challenge, we propose spatial-temporal network (StNet) for better joint spatial-temporal modelling and comprehensively video understanding. Besides, given that multi-modal information is contained in video source, we manage to integrate both early-fusion and later-fusion strategy of multi-modal information via our proposed improved temporal Xception network (iTXN) for video understanding. Our StNet RGB single model achieves 78.99% top-1 precision in the Kinetics-600 validation set and that of our improved temporal Xception network which integrates RGB, flow and audio modalities is up to 82.35%. After model ensemble, we achieve top-1 precision as high as 85.0% on the validation set and rank No.1 among all submissions.

Year	Venue	Field
2018	arXiv: Computer Vision and Pattern Recognition	Modalities,Pattern recognition,Computer science,Action recognition,Fusion,RGB color model,Artificial intelligence,Machine learning,Modal
DocType	Volume	Citations
Journal	abs/1806.10319	1
PageRank	References	Authors
0.42	0	6

Authors (6 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
He, D.	1	33	13.67
Fu Li	2	98	19.30
Qijie Zhao	3	11	3.30
Xiang Long	4	30	10.70
Yi Fu	5	5	1.53
Shilei Wen	6	79	13.59

1