Title
Two-Step Joint Attention Network for Visual Question Answering
Abstract
Visual Question Answering(VQA) system is a task that answers natural language questions automatically according to the content of a reference image. Common method for VQA is to extract image feature and question feature by using deep neural network, and then combine the two features with attention mechanism to predict answer. Most of the attention methods for VQA merely concern about where the local regions of image are relevant to answer and ignore the question words have different weights to answer. Hence, we propose two-step joint attention that use the combining representation of the image feature and question feature to guide visual attention and question attention. Two-step joint attention is able to focus the given image and question from coarse-drained parts to fine-grained parts gradually to predict answer. For purpose of extracting image feature precisely, we also propose a BiSRU and use RNN based on BiSRU to allow the adjacent local region vectors of the image to maintain information each other. We demonstrate and analyze the effectiveness on the VQA dataset, and use visualization to show the results intuitively.
Year
DOI
Venue
2017
10.1109/BIGCOM.2017.17
2017 3rd International Conference on Big Data Computing and Communications (BIGCOM)
Keywords
Field
DocType
VQA,two-step joint attention,BiSRU
Question answering,Joint attention,Interrogative word,Computer science,Visualization,Reference image,Visual attention,Natural language,Natural language processing,Artificial intelligence,Artificial neural network
Conference
ISBN
Citations 
PageRank 
978-1-5386-3350-2
0
0.34
References 
Authors
9
5
Name
Order
Citations
PageRank
Weiming Zhang18315.80
Chunhong Zhang2146.37
Pei Liu344.47
Zhiqiang Zhan482.12
Xiaofeng Qiu501.69