CASCADE ATTENTION FUSION FOR FINE-GRAINED IMAGE CAPTIONING BASED ON MULTI-LAYER LSTM - Citegraph

Paper Info

Title
CASCADE ATTENTION FUSION FOR FINE-GRAINED IMAGE CAPTIONING BASED ON MULTI-LAYER LSTM

Abstract
The conventional visual attention-based image captioning approaches typically use image information to guide caption generation. Results from these models tend to be coarse and ignore the details in the image, such as objects, attributes and the distinguishing aspects of each image. In this paper, we propose a visual and semantic fusion network with a margin-based training guidance mechanism to generate fine image descriptions that depict more objects, attributes and other distinguishing aspects of images. In our model, the visual attention layer introduces more low-level visual information, the semantic attention layer provides more high-level semantic attributes. Furthermore, the proposed margin-based loss encourages our model to produce more discriminative descriptions. Extensive experiments are conducted on COCO and Flickr30K image captioning datasets to validate our method, and the results show its superior performance at captioning. Our method achieves a state-of-the-art 70.6 CIDEr-D on the Flickr30K dataset, and a competitive 123.5 CIDEr-D on the MS-COCO dataset.

Year	DOI	Venue
2021	10.1109/ICASSP39728.2021.9413691	2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021)
Keywords	DocType	Citations
Fine descriptions, Attention mechanism, Multi-layer LSTM, Visual and semantic fusion network	Conference	0
PageRank	References	Authors
0.34	0	7

Authors (7 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Shuang Wang	1	8	2.02
Y Meng	2	0	0.34
Yanfeng Gu	3	742	55.56
Lifeng Zhang	4	5	2.86
Xiongying Ye	5	10	4.50
Jia-Wei Tian	6	12	1.30
Licheng Jiao	7	5698	475.84

1