Two-stream Collaborative Learning with Spatial-Temporal Attention for Video Classification. - Citegraph

Paper Info

Title
Two-stream Collaborative Learning with Spatial-Temporal Attention for Video Classification.

Abstract
Video classification is highly important and has widespread applications, such as video search and intelligent surveillance. Video naturally contains both static and motion information, which can be represented by frames and optical flow, respectively. Recently, researchers have generally adopted deep networks to capture the static and motion information <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">separately</italic> , which has two main limitations. First, the coexistence relationship between spatial and temporal attention is ignored, although they should be jointly modeled as the spatial and temporal evolutions of video to learn discriminative video features. Second, the strong complementarity between static and motion information is ignored, although they should be collaboratively learned to enhance each other. To address the above two limitations, this paper proposes the two-stream collaborative learning with spatial-temporal attention (TCLSTA) approach, which consists of two models. First, for the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">spatial-temporal attention model</italic> , the spatial-level attention emphasizes the salient regions in a frame, and the temporal-level attention exploits the discriminative frames in a video. They are mutually enhanced to jointly learn the discriminative static and motion features for better classification performance. Second, for the <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">static-motion collaborative model</italic> , it not only achieves mutual guidance between static and motion information to enhance the feature learning but also adaptively learns the fusion weights of static and motion streams, thus exploiting the strong complementarity between static and motion information to improve video classification. Experiments on four widely used data sets show that our TCLSTA approach achieves the best performance compared with more than 10 state-of-the-art methods.

Year	DOI	Venue
2017	10.1109/TCSVT.2018.2808685	IEEE Transactions on Circuits and Systems for Video Technology
Keywords	Field	DocType
Feature extraction,Adaptation models,Video sequences,Collaboration,Semantics,Collaborative work,Weapons	Complementarity (molecular biology),Educational technology,Collaborative learning,Pattern recognition,Computer science,Collaborative model,Artificial intelligence,Discriminative model,Optical flow,Machine learning,Feature learning,Salient	Journal
Volume	Issue	ISSN
abs/1711.03273	3	1051-8215
Citations	PageRank	References
15	0.62	55
Authors
3

Authors (3 rows)

Cited by (15 rows)

References (55 rows)

Name	Order	Citations	PageRank
Yuxin Peng	1	1122	74.90
Yunzhen Zhao	2	70	1.75
Junchao Zhang	3	24	1.74

1