Title
Multi-level Visual Fusion Networks for Image Captioning
Abstract
Image captioning is a multi-modal complex task in machine learning. Traditional methods focus only on entities in visual strategy networks, and can’t reason about the relationship between entities and attributes. There are problems of exposure bias and error accumulation in language strategy networks. To this end, this paper proposes a multi-level visual fusion network model based on reinforcement learning. In the visual strategy network, multi-level neural network modules are used to transform visual features into feature sets of visual knowledge. The fusion network generates function words that make the description more fluent, and is used for the interaction between the visual strategy network and the language strategy network. The self-criticism strategy gradient algorithm based on reinforcement learning in language strategy networks is used to achieve end-to-end optimization of visual fusion networks. We evaluated our model on the Flickr 30K and MS-COCO datasets, and verified the accuracy of the model and the diversity of model learning subtitles through experiments. Our model achieves better performance over state-of-the-art methods.
Year
DOI
Venue
2020
10.1109/IJCNN48605.2020.9206932
2020 International Joint Conference on Neural Networks (IJCNN)
Keywords
DocType
ISSN
Visualization,Feature extraction,Task analysis,Reinforcement learning,Object detection,Training,Decoding
Conference
2161-4393
ISBN
Citations 
PageRank 
978-1-7281-6926-2
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Dongming Zhou123.40
Canlong Zhang258.55
Zhixin Li31219.62
Zhiwen Wang400.34