Abstract | ||
---|---|---|
This paper presents a new model for the task of scene text visual question answering. In this task questions about a given image can only be answered by reading and understanding scene text. Current state of the art models for this task make use of a dual attention mechanism in which one attention module attends to visual features while the other attends to textual features. A possible issue with this is that it makes difficult for the model to reason jointly about both modalities. To fix this problem we propose a new model that is based on an single attention mechanism that attends to multi-modal features conditioned to the question. The output weights of this attention module over a grid of multi-modal spatial features are interpreted as the probability that a certain spatial location of the image contains the answer text to the given question. Our experiments demonstrate competitive performance in two standard datasets with a model that is x5 faster than previous methods at inference time. Furthermore, we also provide a novel analysis of the ST-VQA dataset based on a human performance study. Supplementary material, code, and data is made available through this link. (C) 2021 Elsevier B.V. All rights reserved. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1016/j.patrec.2021.06.026 | PATTERN RECOGNITION LETTERS |
Keywords | DocType | Volume |
Deep learning, Scene text, Visual question answering, Multi-modal learning, MSC, 41A05, 41A10, 65D05, 65D17 | Journal | 150 |
ISSN | Citations | PageRank |
0167-8655 | 2 | 0.38 |
References | Authors | |
0 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Lluís Gómez | 1 | 93 | 8.74 |
Ali Furkan Biten | 2 | 9 | 2.18 |
Rubèn Pérez Tito | 3 | 2 | 0.38 |
Andrés Mafla | 4 | 12 | 2.89 |
Marçal Rusiñol | 5 | 2 | 0.38 |
Ernest Valveny | 6 | 647 | 41.65 |
Dimosthenis Karatzas | 7 | 406 | 38.13 |