Abstract | ||
---|---|---|
For autoregressive (AR) modeling of high-resolution images, vector quantization (VQ) represents an image as a sequence of discrete codes. A short sequence length is important for an AR model to reduce its computational costs to consider long-range interactions of codes. However, we postulate that previous VQ cannot shorten the code sequence and generate high-fidelity images together in terms of the rate-distortion trade-off. In this study, we propose the two-stage framework, which consists of Residual-Quantized VAE (RQ-VAE) and RQ-Transformer, to effectively generate high-resolution images. Given a fixed codebook size, RQ-VAE can precisely approximate a feature map of an image and represent the image as a stacked map of discrete codes. Then, RQ-Transformer learns to predict the quantized feature vector at the next position by predicting the next stack of codes. Thanks to the precise approximation of RQ-VAE, we can represent a $256\times 256$ image as $8\times 8$ resolution of the feature map, and RQ-Transformer can efficiently reduce the computational costs. Consequently, our framework out-performs the existing AR models on various benchmarks of unconditional and conditional image generation. Our approach also has a significantly faster sampling speed than previous AR models to generate high-quality images. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/CVPR52688.2022.01123 | IEEE Conference on Computer Vision and Pattern Recognition |
Keywords | DocType | Volume |
Image and video synthesis and generation | Conference | 2022 |
Issue | Citations | PageRank |
1 | 0 | 0.34 |
References | Authors | |
0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Doyup Lee | 1 | 0 | 0.34 |
Chiheon Kim | 2 | 0 | 0.34 |
Saehoon Kim | 3 | 0 | 0.34 |
Minsu Cho | 4 | 677 | 35.74 |
Wook-Shin Han | 5 | 805 | 57.85 |