Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching. - Citegraph

Paper Info

Title
Show Your Faith: Cross-Modal Confidence-Aware Network for Image-Text Matching.

Abstract
Image-text matching bridges vision and language, which is a crucial task in the field of multi-modal intelligence. The key challenge lies in how to measure image-text relevance accurately as matching evidence. Most existing works aggregate the local semantic similarities of matched region-word pairs as the overall relevance, and they typically assume that the matched pairs are equally reliable. However, although a region-word pair is locally matched across modalities, it may be inconsistent/unreliable from the global perspective of image-text, resulting in inaccurate relevance measurement. In this paper, we propose a novel Cross-Modal Confidence-Aware Network to infer the matching confidence that indicates the reliability of matched region-word pairs, which is combined with the local semantic similarities to refine the relevance measurement. Specifically, we first calculate the matching confidence via the relevance between the semantic of image regions and the complete described semantic in the image, with the text as a bridge. Further, to richly express the region semantics, we extend the region to its visual context in the image. Then, local semantic similarities are weighted with the inferred confidence to filter out unreliable matched pairs in aggregating. Comprehensive experiments show that our method achieves state-of-the-art performance on benchmarks Flickr30K and MSCOCO.

Year	Venue	Keywords
2022	AAAI Conference on Artificial Intelligence	Computer Vision (CV)
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Huatian Zhang	1	0	0.34
Zhendong Mao	2	307	25.18
Enhong Chen	3	1235	86.93
Yongdong Zhang	4	2544	166.91

1