Video Moment Retrieval with Hierarchical Contrastive Learning - Citegraph

Paper Info

Title
Video Moment Retrieval with Hierarchical Contrastive Learning

Abstract
ABSTRACTThis paper explores the task of video moment retrieval (VMR), which aims to localize the temporal boundary of a specific moment from an untrimmed video by a sentence query. Previous methods either extract pre-defined candidate moment features and select the moment that best matches the query by ranking, or directly align the boundary clips of a target moment with the query and predict matching scores. Despite their effectiveness, these methods mostly focus only on aligning the query and single-level clip or moment features, and ignore the different granularities involved in the video itself, such as clip, moment, or video, resulting in insufficient cross-modal interaction. To this end, we propose a Temporal Localization Network with Hierarchical Contrastive Learning (HCLNet) for the VMR task. Specifically, we introduce a hierarchical contrastive learning method to better align the query and video by maximizing the mutual information (MI) between query and three different granularities of video to learn informative representations. Meanwhile, we introduce a self-supervised cycle-consistency loss to enforce the further semantic alignment between fine-grained video clips and query words. Experiments on three standard benchmarks show the effectiveness of our proposed method.

Year	DOI	Venue
2022	10.1145/3503161.3547963	International Multimedia Conference
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Bolin Zhang	1	0	0.34
Chao Yang	2	87	22.49
Bin Jiang	3	52	18.13
Xiaokang Zhou	4	0	0.34

1