Image Captioning With Dense Fusion Connection And Improved Stacked Attention Module - Citegraph

Paper Info

Title
Image Captioning With Dense Fusion Connection And Improved Stacked Attention Module

Abstract
In the existing image captioning methods, masked convolution is usually used to generate language description, and traditional residual network (ResNets) methods used for masked convolution bring about the vanishing gradient problem. To address this issue, we propose a new image captioning framework that combines dense fusion connection (DFC) and improved stacked attention module. DFC uses dense convolutional networks (DenseNets) architecture to connect each layer to any other layer in a feed-forward fashion, then adopts ResNets method to combine features through summation. The improved stacked attention module can capture more fine-grained visual information highly relevant to the word prediction. Finally, we employ the Transformer to the image encoder to sufficiently obtain the attended image representation. The experimental results on MS-COCO dataset demonstrate the proposed model can increase CIDEr score from 91.2% to 106.1%, which has higher performance than the comparable models and verifies the effectiveness of the proposed model.

Year	DOI	Venue
2021	10.1007/s11063-021-10431-y	NEURAL PROCESSING LETTERS
Keywords	DocType	Volume
Image captioning, Masked convolution, Dense fusion connection, Improved stacked attention module	Journal	53
Issue	ISSN	Citations
2	1370-4621	0
PageRank	References	Authors
0.34	0	3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Hegui Zhu	1	46	5.73
Ru Wang	2	0	0.34
Xiangde Zhang	3	91	15.32

1