Abstract | ||
---|---|---|
Image captioning is a multi-modal complex task in machine learning. Traditional methods focus only on entities in visual strategy networks, and can’t reason about the relationship between entities and attributes. There are problems of exposure bias and error accumulation in language strategy networks. To this end, this paper proposes a multi-level visual fusion network model based on reinforcement learning. In the visual strategy network, multi-level neural network modules are used to transform visual features into feature sets of visual knowledge. The fusion network generates function words that make the description more fluent, and is used for the interaction between the visual strategy network and the language strategy network. The self-criticism strategy gradient algorithm based on reinforcement learning in language strategy networks is used to achieve end-to-end optimization of visual fusion networks. We evaluated our model on the Flickr 30K and MS-COCO datasets, and verified the accuracy of the model and the diversity of model learning subtitles through experiments. Our model achieves better performance over state-of-the-art methods. |
Year | DOI | Venue |
---|---|---|
2020 | 10.1109/IJCNN48605.2020.9206932 | 2020 International Joint Conference on Neural Networks (IJCNN) |
Keywords | DocType | ISSN |
Visualization,Feature extraction,Task analysis,Reinforcement learning,Object detection,Training,Decoding | Conference | 2161-4393 |
ISBN | Citations | PageRank |
978-1-7281-6926-2 | 0 | 0.34 |
References | Authors | |
0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Dongming Zhou | 1 | 2 | 3.40 |
Canlong Zhang | 2 | 5 | 8.55 |
Zhixin Li | 3 | 12 | 19.62 |
Zhiwen Wang | 4 | 0 | 0.34 |