Abstract | ||
---|---|---|
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ~0.25M images, ~0.76M questions, and ~10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines for VQA are provided and compared with human performance. |
Year | DOI | Venue |
---|---|---|
2015 | 10.1007/s11263-016-0966-6 | International Journal of Computer Vision |
Keywords | Field | DocType |
Visual Question Answering | Question answering,Information retrieval,Computer science,Closed set,Natural language,Artificial intelligence,Natural language processing,Mirroring | Journal |
Volume | Issue | ISSN |
123 | 1 | 0920-5691 |
Citations | PageRank | References |
320 | 7.87 | 44 |
Authors | ||
7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Stanislaw Antol | 1 | 356 | 10.61 |
Aishwarya Agrawal | 2 | 360 | 10.62 |
Jiasen Lu | 3 | 544 | 16.43 |
Margaret Mitchell | 4 | 1450 | 65.37 |
Dhruv Batra | 5 | 2142 | 104.81 |
C. Lawrence Zitnick | 6 | 7321 | 332.72 |
Devi Parikh | 7 | 2929 | 132.01 |