Self-supervised Video Transformer - Citegraph

Paper Info

Title
Self-supervised Video Transformer

Abstract
In this paper, we propose self-supervised training for video transformers using unlabeled video data. From a given video, we create local and global spatiotemporal views with varying spatial sizes and frame rates. Our self-supervised objective seeks to match the features of these different views representing the same video, to be invariant to spatiotemporal variations in actions. To the best of our knowledge, the proposed approach is the first to alleviate the dependency on negative samples or dedicated memory banks in Self-supervised Video Transformer (SVT). Further, owing to the flexibility of Transformer models, SVT supports slow-fast video processing within a single architecture using dynamically adjusted positional encoding and supports longterm relationship modeling along spatiotemporal dimensions. Our approach performs well on four action recognition benchmarks (Kinetics-400, UCF-101, HMDB-51, and SSv2) and converges faster with small batch sizes. Code is available at: https://git.io/J1juJ.

Year	DOI	Venue
2022	10.1109/CVPR52688.2022.00289	IEEE Conference on Computer Vision and Pattern Recognition
Keywords	DocType	Volume
Video analysis and understanding, Action and event recognition, Self-& semi-& meta- & unsupervised learning	Conference	2022
Issue	Citations	PageRank
1	0	0.34
References	Authors
0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Kanchana Ranasinghe	1	0	0.34
Muzammal Naseer	2	10	4.24
Salman Khan	3	387	41.05
Fahad Shahbaz Khan	4	1622	69.24
Michael Ryoo	5	0	0.34

1