Abstract | ||
---|---|---|
Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub github.com/google-research/scenic/tree/main/scenic/projects/owl_vit. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1007/978-3-031-20080-9_42 | European Conference on Computer Vision |
Keywords | DocType | Citations |
Open-vocabulary detection,Transformer,Vision transformer,Zero-shot detection,Image-conditioned detection,One-shot object detection,Contrastive learning,Image-text models,Foundation models,CLIP | Conference | 0 |
PageRank | References | Authors |
0.34 | 0 | 14 |
Name | Order | Citations | PageRank |
---|---|---|---|
Minderer Matthias | 1 | 0 | 0.68 |
Gritsenko Alexey | 2 | 0 | 0.34 |
Stone Austin | 3 | 0 | 0.34 |
Neumann Maxim | 4 | 0 | 0.34 |
Weissenborn Dirk | 5 | 0 | 0.34 |
Alexey Dosovitskiy | 6 | 1797 | 80.48 |
Mahendran Aravindh | 7 | 0 | 0.34 |
Arnab Anurag | 8 | 0 | 0.34 |
Dehghani Mostafa | 9 | 0 | 0.34 |
Shen Zhuoran | 10 | 0 | 0.34 |
Wang Xiao | 11 | 0 | 0.34 |
Xiaohua Zhai | 12 | 209 | 13.00 |
Kipf Thomas | 13 | 0 | 0.34 |
Neil Houlsby | 14 | 153 | 14.73 |