TRAR - Routing the Attention Spans in Transformer for Visual Question Answering. - Citegraph

Paper Info

Title
TRAR - Routing the Attention Spans in Transformer for Visual Question Answering.

Abstract
Due to the superior ability of global dependency modeling, Transformer and its variants have become the primary choice of many vision-and-language tasks. However, in tasks like Visual Question Answering (VQA) and Referring Expression Comprehension (REC), the multimodal prediction often requires visual information from macro- to micro-views. Therefore, how to dynamically schedule the global and local dependency modeling in Transformer has become an emerging issue. In this paper, we propose an example-dependent routing scheme called TRAnsformer Routing (TRAR) to address this issue. Specifically, in TRAR, each visual Transformer layer is equipped with a routing module with different attention spans. The model can dynamically select the corresponding attentions based on the output of the previous inference step, so as to formulate the optimal routing path for each example. Notably, with careful designs, TRAR can reduce the additional computation and memory overhead to almost negligible. To validate TRAR, we conduct extensive experiments on five benchmark datasets of VQA and REC, and achieve superior performance gains than the standard Transformers and a bunch of state-of-the-art methods.

Year	DOI	Venue
2021	10.1109/ICCV48922.2021.00208	ICCV
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	8

Authors (8 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Yiyi Zhou	1	7	3.46
Tianhe Ren	2	0	0.34
Chaoyang Zhu	3	0	0.68
Xiaoshuai Sun	4	623	58.76
Jianzhuang Liu	5	1614	98.72
Xinghao Ding	6	591	52.95
Mingliang Xu	7	372	54.07
Rongrong Ji	8	3616	189.98

1