ViT-YOLO:Transformer-Based YOLO for Object Detection - Citegraph

Paper Info

Title
ViT-YOLO:Transformer-Based YOLO for Object Detection

Abstract
Drone captured images have overwhelming characteristics including dramatic scale variance, complicated background filled with distractors, and flexible viewpoints, which pose enormous challenges for general object detectors based on common convolutional networks. Recently, the design of vision backbone architectures that use self-attention is an exciting topic. In this work, an improved backbone MHSA-Darknet is designed to retain sufficient global context information and extract more differentiated features for object detection via multi-head self-attention. Regarding the path-aggregation neck, we present a simple yet highly effective weighted bi-directional feature pyramid network (BiFPN) for effectively cross-scale feature fusion. In addition, other techniques including time-test augmentation (TTA) and wighted boxes fusion (WBF) help to achieve better accuracy and robustness. Our experiments demonstrate that ViT-YOLO significantly outperforms the state-of-the-art detectors and achieve one of the top results in VisDrone-DET 2021 challenge (39.41 mAP for test-challenge data set and 41 mAP for the test-dev data set).

Year	DOI	Venue
2021	10.1109/ICCVW54120.2021.00314	2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021)
DocType	Volume	Issue
Conference	2021	1
ISSN	Citations	PageRank
2473-9936	0	0.34
References	Authors
4	6

Authors (6 rows)

Cited by (0 rows)

References (4 rows)

Name	Order	Citations	PageRank
Zixiao Zhang	1	0	0.34
Xiaoqiang Lu	2	0	0.34
Guojin Cao	3	0	1.35
Yuting Yang	4	1	1.37
Licheng Jiao	5	5698	475.84
Fang Liu	6	25	6.03

1