Title
Counterfactual Contrastive Learning for Weakly-Supervised Vision-Language Grounding
Abstract
Weakly-supervised vision-language grounding aims to localize a target moment in a video or a specific region in an image according to the given sentence query, where only video-level or image-level sentence annotations are provided during training. Most existing approaches employ the MIL-based or reconstruction-based paradigms for the WSVLG task, but the former heavily depends on the quality of randomly-selected negative samples and the latter cannot directly optimize the visual-textual alignment score. In this paper, we propose a novel Counterfactual Contrastive Learning (CCL) to develop sufficient contrastive training between counterfactual positive and negative results, which are based on robust and destructive counterfactual transformations. Concretely, we design three counterfactual transformation strategies from the feature-, interaction- and relation-level, where the feature-level method damages the visual features of selected proposals, interaction-level approach confuses the vision-language interaction and relation-level strategy destroys the context clues in proposal relationships. Extensive experiments on five vision-language grounding datasets verify the effectiveness of our CCL paradigm.
Year
Venue
DocType
2020
NIPS 2020
Conference
Volume
Citations 
PageRank 
33
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Zhu Zhang1315.25
Zhou Zhao277390.87
Zhijie Lin312.04
Jieming Zhu4445.27
Xiuqiang He531239.21