Title
ViT-YOLO:Transformer-Based YOLO for Object Detection
Abstract
Drone captured images have overwhelming characteristics including dramatic scale variance, complicated background filled with distractors, and flexible viewpoints, which pose enormous challenges for general object detectors based on common convolutional networks. Recently, the design of vision backbone architectures that use self-attention is an exciting topic. In this work, an improved backbone MHSA-Darknet is designed to retain sufficient global context information and extract more differentiated features for object detection via multi-head self-attention. Regarding the path-aggregation neck, we present a simple yet highly effective weighted bi-directional feature pyramid network (BiFPN) for effectively cross-scale feature fusion. In addition, other techniques including time-test augmentation (TTA) and wighted boxes fusion (WBF) help to achieve better accuracy and robustness. Our experiments demonstrate that ViT-YOLO significantly outperforms the state-of-the-art detectors and achieve one of the top results in VisDrone-DET 2021 challenge (39.41 mAP for test-challenge data set and 41 mAP for the test-dev data set).
Year
DOI
Venue
2021
10.1109/ICCVW54120.2021.00314
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021)
DocType
Volume
Issue
Conference
2021
1
ISSN
Citations 
PageRank 
2473-9936
0
0.34
References 
Authors
4
6
Name
Order
Citations
PageRank
Zixiao Zhang100.34
Xiaoqiang Lu200.34
Guojin Cao301.35
Yuting Yang411.37
Licheng Jiao55698475.84
Fang Liu6256.03