Title
Photo Stream Question Answer
Abstract
Understanding and reasoning over partially observed visual clues are often regarded as a challenging real-world problem even for human beings. In this paper, we present a new visual question answering (VQA) task -- Photo Stream QA, which aims to answer the open-ended questions about a narrative photo stream. Photo Stream QA is more challenging and interesting than the existing VQA tasks, since the temporal and visual variance among photos in the stream is huge and hard to observe. Therefore, instead of learning simple vision-text mappings, the AI algorithms must fill these variance gaps with more recollection, reasoning, even the knowledge from our daily experiences. To tackle the problems in Photo Stream QA, we propose an end-to-end baseline (E-TAA) with a novel Experienced Unit (E-unit) and Three-stage Alternating Attention (TAA). E-unit yields a better visual representation which captures the temporal semantic relation among visual clues in the photo stream, while TAA creates three levels of attention that gradually refines visual features by using the textual representation from the question as the guidance. Experimental results on our developed dataset demonstrate that, as the first attempt at the Photo Stream QA task, E-TAA provides promising results outperforming all the other baseline methods.
Year
DOI
Venue
2020
10.1145/3394171.3413745
MM '20: The 28th ACM International Conference on Multimedia Seattle WA USA October, 2020
DocType
ISBN
Citations 
Conference
978-1-4503-7988-5
0
PageRank 
References 
Authors
0.34
27
7
Name
Order
Citations
PageRank
Wenqiao Zhang132.73
Siliang Tang217933.98
Yanpeng Cao3306.32
Jun Xiao451350.95
Shiliang Pu518742.65
Fei Wu62209153.88
Yue-Ting Zhuang73549216.06