Stacmr: Scene-Text Aware Cross-Modal Retrieval - Citegraph

Paper Info

Title
Stacmr: Scene-Text Aware Cross-Modal Retrieval

Abstract
Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modal retrieval approaches benefit from scene text and highlight interesting research questions worth exploring further. Dataset and code are available at europe.naverlabs.com/stacmr.

Year	DOI	Venue
2021	10.1109/WACV48630.2021.00227	2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021
DocType	ISSN	Citations
Conference	2472-6737	0
PageRank	References	Authors
0.34	0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Andrés Mafla	1	12	2.89
Rafael Sampaio de Rezende	2	14	3.19
Lluís Gómez	3	93	8.74
Diane Larlus	4	2	1.39
Dimosthenis Karatzas	5	406	38.13

1