Title
Image Captioning With Dense Fusion Connection And Improved Stacked Attention Module
Abstract
In the existing image captioning methods, masked convolution is usually used to generate language description, and traditional residual network (ResNets) methods used for masked convolution bring about the vanishing gradient problem. To address this issue, we propose a new image captioning framework that combines dense fusion connection (DFC) and improved stacked attention module. DFC uses dense convolutional networks (DenseNets) architecture to connect each layer to any other layer in a feed-forward fashion, then adopts ResNets method to combine features through summation. The improved stacked attention module can capture more fine-grained visual information highly relevant to the word prediction. Finally, we employ the Transformer to the image encoder to sufficiently obtain the attended image representation. The experimental results on MS-COCO dataset demonstrate the proposed model can increase CIDEr score from 91.2% to 106.1%, which has higher performance than the comparable models and verifies the effectiveness of the proposed model.
Year
DOI
Venue
2021
10.1007/s11063-021-10431-y
NEURAL PROCESSING LETTERS
Keywords
DocType
Volume
Image captioning, Masked convolution, Dense fusion connection, Improved stacked attention module
Journal
53
Issue
ISSN
Citations 
2
1370-4621
0
PageRank 
References 
Authors
0.34
0
3
Name
Order
Citations
PageRank
Hegui Zhu1465.73
Ru Wang200.34
Xiangde Zhang39115.32