Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding - Citegraph

Paper Info

Title
Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding

Abstract
Weakly-supervised vision-language grounding aims to localize a target moment in a video or a specific region in an image according to the given sentence query, where only video-level or image-level sentence annotations are provided during training. Most existing approaches employ the MIL-based or reconstruction-based paradigms for the WSVLG task, but the former heavily depends on the quality of randomly-selected negative samples and the latter cannot directly optimize the visual-textual alignment score. In this paper, we propose a novel Counterfactual Contrastive Learning (CCL) to develop sufficient contrastive training between counterfactual positive and negative results, which are based on robust and destructive counterfactual transformations. Concretely, we design three counterfactual transformation strategies from the feature-, interaction- and relation-level, where the feature-level method damages the visual features of selected proposals, interaction-level approach confuses the vision-language interaction and relation-level strategy destroys the context clues in proposal relationships. Extensive experiments on five vision-language grounding datasets verify the effectiveness of our CCL paradigm.

Year	Venue	DocType
2020	NIPS 2020	Conference
Volume	Citations	PageRank
33	0	0.34
References	Authors
0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Zhu Zhang	1	31	5.25
Zhou Zhao	2	773	90.87
Zhijie Lin	3	1	2.04
Jieming Zhu	4	44	5.27
Xiuqiang He	5	312	39.21

1