Title
Video Moment Retrieval with Hierarchical Contrastive Learning
Abstract
ABSTRACTThis paper explores the task of video moment retrieval (VMR), which aims to localize the temporal boundary of a specific moment from an untrimmed video by a sentence query. Previous methods either extract pre-defined candidate moment features and select the moment that best matches the query by ranking, or directly align the boundary clips of a target moment with the query and predict matching scores. Despite their effectiveness, these methods mostly focus only on aligning the query and single-level clip or moment features, and ignore the different granularities involved in the video itself, such as clip, moment, or video, resulting in insufficient cross-modal interaction. To this end, we propose a Temporal Localization Network with Hierarchical Contrastive Learning (HCLNet) for the VMR task. Specifically, we introduce a hierarchical contrastive learning method to better align the query and video by maximizing the mutual information (MI) between query and three different granularities of video to learn informative representations. Meanwhile, we introduce a self-supervised cycle-consistency loss to enforce the further semantic alignment between fine-grained video clips and query words. Experiments on three standard benchmarks show the effectiveness of our proposed method.
Year
DOI
Venue
2022
10.1145/3503161.3547963
International Multimedia Conference
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Bolin Zhang100.34
Chao Yang28722.49
Bin Jiang35218.13
Xiaokang Zhou400.34