Title
CASCADE ATTENTION FUSION FOR FINE-GRAINED IMAGE CAPTIONING BASED ON MULTI-LAYER LSTM
Abstract
The conventional visual attention-based image captioning approaches typically use image information to guide caption generation. Results from these models tend to be coarse and ignore the details in the image, such as objects, attributes and the distinguishing aspects of each image. In this paper, we propose a visual and semantic fusion network with a margin-based training guidance mechanism to generate fine image descriptions that depict more objects, attributes and other distinguishing aspects of images. In our model, the visual attention layer introduces more low-level visual information, the semantic attention layer provides more high-level semantic attributes. Furthermore, the proposed margin-based loss encourages our model to produce more discriminative descriptions. Extensive experiments are conducted on COCO and Flickr30K image captioning datasets to validate our method, and the results show its superior performance at captioning. Our method achieves a state-of-the-art 70.6 CIDEr-D on the Flickr30K dataset, and a competitive 123.5 CIDEr-D on the MS-COCO dataset.
Year
DOI
Venue
2021
10.1109/ICASSP39728.2021.9413691
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)
Keywords
DocType
Citations 
Fine descriptions, Attention mechanism, Multi-layer LSTM, Visual and semantic fusion network
Conference
0
PageRank 
References 
Authors
0.34
0
7
Name
Order
Citations
PageRank
Shuang Wang182.02
Y Meng200.34
Yanfeng Gu374255.56
Lifeng Zhang452.86
Xiongying Ye5104.50
Jia-Wei Tian6121.30
Licheng Jiao75698475.84