Multi-level Visual Fusion Networks for Image Captioning - Citegraph

Paper Info

Title
Multi-level Visual Fusion Networks for Image Captioning

Abstract
Image captioning is a multi-modal complex task in machine learning. Traditional methods focus only on entities in visual strategy networks, and can’t reason about the relationship between entities and attributes. There are problems of exposure bias and error accumulation in language strategy networks. To this end, this paper proposes a multi-level visual fusion network model based on reinforcement learning. In the visual strategy network, multi-level neural network modules are used to transform visual features into feature sets of visual knowledge. The fusion network generates function words that make the description more fluent, and is used for the interaction between the visual strategy network and the language strategy network. The self-criticism strategy gradient algorithm based on reinforcement learning in language strategy networks is used to achieve end-to-end optimization of visual fusion networks. We evaluated our model on the Flickr 30K and MS-COCO datasets, and verified the accuracy of the model and the diversity of model learning subtitles through experiments. Our model achieves better performance over state-of-the-art methods.

Year	DOI	Venue
2020	10.1109/IJCNN48605.2020.9206932	2020 International Joint Conference on Neural Networks (IJCNN)
Keywords	DocType	ISSN
Visualization,Feature extraction,Task analysis,Reinforcement learning,Object detection,Training,Decoding	Conference	2161-4393
ISBN	Citations	PageRank
978-1-7281-6926-2	0	0.34
References	Authors
0	4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Dongming Zhou	1	2	3.40
Canlong Zhang	2	5	8.55
Zhixin Li	3	12	19.62
Zhiwen Wang	4	0	0.34

1