MaMiCo: Macro-to-Micro Semantic Correspondence for Self-supervised Video Representation Learning - Citegraph

Paper Info

Title
MaMiCo: Macro-to-Micro Semantic Correspondence for Self-supervised Video Representation Learning

Abstract
ABSTRACTContrastive self-supervised learning (CSL) has remarkably promoted the progress of visual representation learning. However, existing video CSL methods mainly focus on clip-level temporal semantic consistency. The temporal and spatial semantic correspondence across different granularities, i.e., video, clip, and frame levels, is typically overlooked. To tackle this issue, we propose a self-supervised Macro-to-Micro Semantic Correspondence (MaMiCo) learning framework, pursuing fine-grained spatiotemporal representations from a macro-to-micro perspective. Specifically, MaMiCo constructs a multiple branch architecture of T-MaMiCo and S-MaMiCo on a temporally-nested clip pyramid (video-to-frame). On the pyramid, T-MaMiCo aims at temporal correspondence by simultaneously assimilating semantic invariance representations and retaining appearance dynamics in long temporal ranges. For spatial correspondence, S-MaMiCo perceives subtle motion cues via ameliorating dense CSL for videos where stationary clips are applied for stably dense contrasting reference to alleviate semantic inconsistency caused by ''mismatching''. Extensive experiments justify that MaMiCo learns rich general video representations and works well on various downstream tasks, e.g., (fine-grained) action recognition, action localization, and video retrieval.

Year	DOI	Venue
2022	10.1145/3503161.3547888	International Multimedia Conference
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	6

Authors (6 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Bo Fang	1	0	0.34
Wenhao Wu	2	10	4.87
Chang Liu	3	571	117.41
Yu Zhou	4	98	22.73
He, D.	5	33	13.67
Wang Weiping	6	335	63.84

1