Abstract | ||
---|---|---|
Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset. These assumptions might involve complex architectures, auxiliary losses, or side information such as object part labels or segmentation masks supplied during training. We describe a simple approach for this task based on a transformer that autoregressively models the text and image tokens as a single stream of data. With sufficient data and scale, our approach is competitive with previous domain-specific models when evaluated in a zero-shot fashion. |
Year | Venue | DocType |
---|---|---|
2021 | INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139 | Conference |
Volume | ISSN | Citations |
139 | 2640-3498 | 0 |
PageRank | References | Authors |
0.34 | 0 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Aditya Ramesh | 1 | 0 | 1.01 |
Mikhail Pavlov | 2 | 0 | 0.34 |
Gabriel Goh | 3 | 0 | 0.34 |
scott gray | 4 | 45 | 2.12 |
Chelsea Voss | 5 | 0 | 0.34 |
alec radford | 6 | 2165 | 75.60 |
Mark Chen | 7 | 0 | 1.35 |
Ilya Sutskever | 8 | 25814 | 1120.24 |