Abstract | ||
---|---|---|
Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modal retrieval approaches benefit from scene text and highlight interesting research questions worth exploring further. Dataset and code are available at europe.naverlabs.com/stacmr. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/WACV48630.2021.00227 | 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021 |
DocType | ISSN | Citations |
Conference | 2472-6737 | 0 |
PageRank | References | Authors |
0.34 | 0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Andrés Mafla | 1 | 12 | 2.89 |
Rafael Sampaio de Rezende | 2 | 14 | 3.19 |
Lluís Gómez | 3 | 93 | 8.74 |
Diane Larlus | 4 | 2 | 1.39 |
Dimosthenis Karatzas | 5 | 406 | 38.13 |