Self-Adaptive Neural Module Transformer for Visual Question Answering - Citegraph

Paper Info

Title
Self-Adaptive Neural Module Transformer for Visual Question Answering

Abstract
Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">reasoning</i> steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).

Year	DOI	Venue
2021	10.1109/TMM.2020.2995278	IEEE Transactions on Multimedia
Keywords	DocType	Volume
Visual question answering,neural module transformer,multi modal,self-adaptive	Journal	23
ISSN	Citations	PageRank
1520-9210	3	0.37
References	Authors
0	6

Authors (6 rows)

Cited by (3 rows)

References (0 rows)

Name	Order	Citations	PageRank
Zhong Huasong	1	3	0.37
Jingyuan Chen	2	228	7.50
chen shen	3	103	17.21
Hanwang Zhang	4	1965	78.34
Jianqiang Huang	5	55	19.18
Xian-Sheng Hua	6	6566	328.17

1