Abstract | ||
---|---|---|
In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions. The proposed method consists of three components: StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization. The inversion module maps real images to the latent space of a well-trained StyleGAN. The visual-linguistic similarity learns the text-image matching by mapping the image and text into a common embedding space. The instance-level optimization is for identity preservation in manipulation. Our model can produce diverse and high-quality images with an unprecedented resolution at 1024(2). Using a control mechanism based on style-mixing, our TediGAN inherently supports image synthesis with multi-modal inputs, such as sketches or semantic labels, with or without instance guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/CVPR46437.2021.00229 | 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021 |
DocType | ISSN | Citations |
Conference | 1063-6919 | 2 |
PageRank | References | Authors |
0.35 | 14 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Weihao Xia | 1 | 6 | 2.09 |
Yang Yu-Jiu | 2 | 89 | 19.30 |
Jing-Hao Xue | 3 | 393 | 46.48 |
Baoyuan Wu | 4 | 267 | 25.15 |