Abstract | ||
---|---|---|
Modern self-supervised learning algorithms typically enforce persistency of instance representations across views. While being very effective on learning holistic image and video representations, such an objective becomes suboptimal for learning spatio-temporally fine-grained features in videos, where scenes and instances evolve through space and time. In this paper, we present Contextualized Spatio-Temporal Contrastive Learning (ConST-CL) to effectively learn spatio-temporally fine-grained video representations via self-supervision. We first design a region-based pretext task which requires the model to transform instance representations from one view to another, guided by context features. Further, we introduce a simple network design that successfully reconciles the simultaneous learning process of both holistic and local representations. We evaluate our learned representations on a variety of downstream tasks and show that ConST-CL achieves competitive results on 6 datasets, including Kinetics, UCF, HMDB, AVA-Kinetics, AVA and OTB. Our code and models will be available at https://github.com/tensorflow/models/tree/master/official/projects/const_cl. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/CVPR52688.2022.01359 | IEEE Conference on Computer Vision and Pattern Recognition |
Keywords | DocType | Volume |
Video analysis and understanding, Action and event recognition, Representation learning, Self-& semi-& meta- & unsupervised learning | Conference | 2022 |
Issue | Citations | PageRank |
1 | 0 | 0.34 |
References | Authors | |
0 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Liangzhe Yuan | 1 | 19 | 1.96 |
Rui Qian | 2 | 2 | 1.04 |
Yin Cui | 3 | 262 | 11.30 |
Boqing Gong | 4 | 685 | 33.29 |
Florian Schroff | 5 | 757 | 32.72 |
Yang Ming-Hsuan | 6 | 15303 | 620.69 |
Hartwig Adam | 7 | 1326 | 42.50 |
Ting Liu | 8 | 30 | 4.08 |