Title
TRAR - Routing the Attention Spans in Transformer for Visual Question Answering.
Abstract
Due to the superior ability of global dependency modeling, Transformer and its variants have become the primary choice of many vision-and-language tasks. However, in tasks like Visual Question Answering (VQA) and Referring Expression Comprehension (REC), the multimodal prediction often requires visual information from macro- to micro-views. Therefore, how to dynamically schedule the global and local dependency modeling in Transformer has become an emerging issue. In this paper, we propose an example-dependent routing scheme called TRAnsformer Routing (TRAR) to address this issue. Specifically, in TRAR, each visual Transformer layer is equipped with a routing module with different attention spans. The model can dynamically select the corresponding attentions based on the output of the previous inference step, so as to formulate the optimal routing path for each example. Notably, with careful designs, TRAR can reduce the additional computation and memory overhead to almost negligible. To validate TRAR, we conduct extensive experiments on five benchmark datasets of VQA and REC, and achieve superior performance gains than the standard Transformers and a bunch of state-of-the-art methods.
Year
DOI
Venue
2021
10.1109/ICCV48922.2021.00208
ICCV
DocType
Citations 
PageRank 
Conference
0
0.34
References 
Authors
0
8
Name
Order
Citations
PageRank
Yiyi Zhou173.46
Tianhe Ren200.34
Chaoyang Zhu300.68
Xiaoshuai Sun462358.76
Jianzhuang Liu5161498.72
Xinghao Ding659152.95
Mingliang Xu737254.07
Rongrong Ji83616189.98