XGPT - Cross-modal Generative Pre-Training for Image Captioning. - Citegraph

Paper Info

Title
XGPT - Cross-modal Generative Pre-Training for Image Captioning.

Abstract
While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new method of Cross-modal Generative Pre-Training for Image Captioning that is designed to pre-train text-to-image caption generators through three novel generation tasks, including Image-conditioned Masked Language Modeling (IMLM), Image-conditioned Denoising Autoencoding (IDA), and Text-conditioned Image Feature Generation (TIFG). As a result, the pre-trained XGPT can be fine-tuned without any task-specific architecture modifications to create state-of-the-art models for image captioning. Experiments show that XGPT obtains new state-of-the-art results on the benchmark datasets, including COCO Captions and Flickr30k Captions. We also use XGPT to generate new image captions as data augmentation for the image retrieval task and achieve significant improvement on all recall metrics.

Year	DOI	Venue
2021	10.1007/978-3-030-88480-2_63	NLPCC
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	9

Authors (9 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Qiaolin Xia	1	10	2.85
Haoyang Huang	2	1	2.05
Nan Duan	3	213	45.87
Dongdong Zhang	4	241	28.73
Ji Lei	5	0	0.34
Zhifang Sui	6	172	39.06
Cui Edward	7	0	0.34
Bharti Taroon	8	0	0.34
Ming Zhou	9	4262	251.74

1