Abstract | ||
---|---|---|
Understanding and reasoning over partially observed visual clues are often regarded as a challenging real-world problem even for human beings. In this paper, we present a new visual question answering (VQA) task -- Photo Stream QA, which aims to answer the open-ended questions about a narrative photo stream. Photo Stream QA is more challenging and interesting than the existing VQA tasks, since the temporal and visual variance among photos in the stream is huge and hard to observe. Therefore, instead of learning simple vision-text mappings, the AI algorithms must fill these variance gaps with more recollection, reasoning, even the knowledge from our daily experiences. To tackle the problems in Photo Stream QA, we propose an end-to-end baseline (E-TAA) with a novel Experienced Unit (E-unit) and Three-stage Alternating Attention (TAA). E-unit yields a better visual representation which captures the temporal semantic relation among visual clues in the photo stream, while TAA creates three levels of attention that gradually refines visual features by using the textual representation from the question as the guidance. Experimental results on our developed dataset demonstrate that, as the first attempt at the Photo Stream QA task, E-TAA provides promising results outperforming all the other baseline methods.
|
Year | DOI | Venue |
---|---|---|
2020 | 10.1145/3394171.3413745 | MM '20: The 28th ACM International Conference on Multimedia
Seattle
WA
USA
October, 2020 |
DocType | ISBN | Citations |
Conference | 978-1-4503-7988-5 | 0 |
PageRank | References | Authors |
0.34 | 27 | 7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Wenqiao Zhang | 1 | 3 | 2.73 |
Siliang Tang | 2 | 179 | 33.98 |
Yanpeng Cao | 3 | 30 | 6.32 |
Jun Xiao | 4 | 513 | 50.95 |
Shiliang Pu | 5 | 187 | 42.65 |
Fei Wu | 6 | 2209 | 153.88 |
Yue-Ting Zhuang | 7 | 3549 | 216.06 |