Title
MaMiCo: Macro-to-Micro Semantic Correspondence for Self-supervised Video Representation Learning
Abstract
ABSTRACTContrastive self-supervised learning (CSL) has remarkably promoted the progress of visual representation learning. However, existing video CSL methods mainly focus on clip-level temporal semantic consistency. The temporal and spatial semantic correspondence across different granularities, i.e., video, clip, and frame levels, is typically overlooked. To tackle this issue, we propose a self-supervised Macro-to-Micro Semantic Correspondence (MaMiCo) learning framework, pursuing fine-grained spatiotemporal representations from a macro-to-micro perspective. Specifically, MaMiCo constructs a multiple branch architecture of T-MaMiCo and S-MaMiCo on a temporally-nested clip pyramid (video-to-frame). On the pyramid, T-MaMiCo aims at temporal correspondence by simultaneously assimilating semantic invariance representations and retaining appearance dynamics in long temporal ranges. For spatial correspondence, S-MaMiCo perceives subtle motion cues via ameliorating dense CSL for videos where stationary clips are applied for stably dense contrasting reference to alleviate semantic inconsistency caused by ''mismatching''. Extensive experiments justify that MaMiCo learns rich general video representations and works well on various downstream tasks, e.g., (fine-grained) action recognition, action localization, and video retrieval.
Year
DOI
Venue
2022
10.1145/3503161.3547888
International Multimedia Conference
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
6
Name
Order
Citations
PageRank
Bo Fang100.34
Wenhao Wu2104.87
Chang Liu3571117.41
Yu Zhou49822.73
He, D.53313.67
Wang Weiping633563.84