Abstract | ||
---|---|---|
Inspired by progress in unsupervised representation learning for natural language, we examine whether similar models can learn useful representations for images. We train a sequence Transformer to auto-regressively predict pixels, without incorporating knowledge of the 2D input structure. Despite training on low-resolution ImageNet without labels, we find that a GPT-2 scale model learns strong image representations as measured by linear probing, fine-tuning, and low-data classification. On CIFAR-10, we achieve 96.3% accuracy with a linear probe, outperforming a supervised Wide ResNet, and 99.0% accuracy with full fine-tuning, matching the top supervised pre-trained models. We are also competitive with self-supervised benchmarks on ImageNet when substituting pixels for a VQVAE encoding, achieving 69.0% top-1 accuracy on a linear probe of our features. |
Year | Venue | DocType |
---|---|---|
2020 | ICML | Conference |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Mark Chen | 1 | 0 | 1.35 |
alec radford | 2 | 2165 | 75.60 |
Rewon Child | 3 | 38 | 3.79 |
Jeffrey K Wu | 4 | 0 | 0.68 |
Heewoo Jun | 5 | 11 | 1.53 |
David Luan | 6 | 0 | 0.34 |
Ilya Sutskever | 7 | 25814 | 1120.24 |