Abstract | ||
---|---|---|
In this work, we introduce VQA 360°, a novel task of visual question answering on 360° images. Unlike a normal field-of-view image, a 360° image captures the entire visual content around the optical center of a camera, demanding more sophisticated spatial understanding and reasoning. To address this problem, we collect the first VQA 360° dataset, containing around 17,000 real-world image-question-answer triplets for a variety of question types. We then study two different VQA models on VQA 360°, including one conventional model that takes an equirectangular image (with intrinsic distortion) as input and one dedicated model that first projects a 360° image onto cubemaps and subsequently aggregates the information from multiple spatial resolutions. We demonstrate that the cubemap-based model with multi-level fusion and attention diffusion performs favorably against other variants and the equirectangular-based models. Nevertheless, the gap between the humans’ and machines’ performance reveals the need for more advanced VQA 360° algorithms. We, therefore, expect our dataset and studies to serve as the benchmark for future development in this challenging task. Dataset, code, and pre-trained models are available online.1 |
Year | DOI | Venue |
---|---|---|
2020 | 10.1109/WACV45572.2020.9093452 | WACV |
DocType | Citations | PageRank |
Conference | 1 | 0.34 |
References | Authors | |
0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Shih-Han Chou | 1 | 1 | 0.68 |
Wei-Lun Chao | 2 | 391 | 19.32 |
Wei-Sheng Lai | 3 | 164 | 12.87 |
Min Sun | 4 | 1083 | 59.15 |
Yang Ming-Hsuan | 5 | 15303 | 620.69 |