Title
Dynamic Grained Encoder for Vision Transformers.
Abstract
Transformers, the de-facto standard for language modeling, have been recently applied for vision tasks. This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images and save computational costs. Specifically, we propose a Dynamic Grained Encoder for vision transformers, which can adaptively assign a suitable number of queries to each spatial region. Thus it achieves a fine-grained representation in discriminative regions while keeping high efficiency. Besides, the dynamic grained encoder is compatible with most vision transformer frameworks. Without bells and whistles, our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification. Extensive experiments on object detection and segmentation further demonstrate the generalizability of our approach. Code is available at https://github.com/StevenGrove/vtpack.
Year
Venue
DocType
2021
Annual Conference on Neural Information Processing Systems
Conference
Citations 
PageRank 
References 
0
0.34
0
Authors
8
Name
Order
Citations
PageRank
Lin Song1103.21
Songyang Zhang2116.93
Songtao Liu300.68
Zeming Li400.34
Xuming He500.34
Hongbin Sun636542.04
Jian Sun725842956.90
Nanning Zheng800.34