Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation - Citegraph

Paper Info

Title
Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

Abstract
The recent large-scale vision-language pretraining (VLP) of dual-stream architectures (e.g., CLIP) with a tremendous amount of image-text pair data, has shown its superiority on various multimodal alignment tasks. Despite its success, the resulting models are not capable of multimodal generative tasks due to the weak text encoder. To tackle this problem, we propose to augment the dual-stream VLP model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD), enabling the capability for multimodal generation. VLKD is pretty data- and computation-efficient compared to the pre-training from scratch. Experimental results show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning. For example, it achieves 44.5% zero-shot accuracy on the VQAv2 dataset, surpassing the previous state-of-the-art zero-shot model with 7x fewer parameters. Furthermore, the original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.

Year	DOI	Venue
2022	10.18653/v1/2022.findings-acl.187	FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022)
DocType	Volume	Citations
Conference	Findings of the Association for Computational Linguistics: ACL 2022	0
PageRank	References	Authors
0.34	0	6

Authors (6 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Wenliang Dai	1	0	1.01
lu hou	2	62	6.80
Lifeng Shang	3	485	30.96
Xin Jiang	4	150	32.43
Qun Liu	5	2149	203.11
Pascale Fung	6	678	85.84

1