Title | ||
---|---|---|
Multi-Modal Reasoning Graph For Scene-Text Based Fine-Grained Image Classification And Retrieval |
Abstract | ||
---|---|---|
Scene text instances found in natural images carry explicit semantic information that can provide important cues to solve a wide array of computer vision problems. In this paper, we focus on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval. First, we obtain the text instances from images by employing a text reading system. Then, we combine textual features with salient image regions to exploit the complementary information carried by the two sources. Specifically, we employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image. By obtaining an enhanced set of visual and textual features, the proposed model greatly outperforms previous state-of-the-art in two different tasks, fine-grained classification and image retrieval in the ConText[23] and Drink Bottle[4] datasets. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/WACV48630.2021.00407 | 2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION WACV 2021 |
DocType | ISSN | Citations |
Conference | 2472-6737 | 0 |
PageRank | References | Authors |
0.34 | 0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Andrés Mafla | 1 | 12 | 2.89 |
Sounak Dey | 2 | 12 | 7.03 |
Ali Furkan Biten | 3 | 0 | 1.35 |
Lluís Gómez | 4 | 0 | 0.34 |
Dimosthenis Karatzas | 5 | 406 | 38.13 |