Abstract | ||
---|---|---|
Most of the existing image captioning models mainly use global attention, which represents the whole image fea-tures, local attention, representing the object features, or a combination of them; there are few models to inte-grate the relationship information between various object regions of the image. But this relationship information is also very instructive for caption generation. For example, if a football appears, there is a high prob-ability that the image also contains people near the football. In this article, the relationship feature is embedded into the global-local attention to constructing a new Pyramid Attention mechanism, which can explore the inter-nal visual and semantic relationship between different object regions. Besides, to alleviate the exposure bias problem and make the training process more efficient, we propose a new method to apply the Generative Adver-sarial Network into sequence generation. The greedy decoding method is used to generate an efficient baseline reward for self-critical training. Finally, experiments on MSCOCO dataset show that the model can generate more accurate and vivid captions and outperforms many recent advanced models in various prevailing evalua-tion metrics on both local and online test sets.(c) 2021 Elsevier B.V. All rights reserved. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1016/j.imavis.2021.104340 | Image and Vision Computing |
Keywords | DocType | Volume |
Image captioning,Pyramid Attention network,Self-critical training,Reinforcement learning,Generative adversarial network,Sequence-level learning | Journal | 117 |
ISSN | Citations | PageRank |
0262-8856 | 0 | 0.34 |
References | Authors | |
0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Tianyu Chen | 1 | 0 | 0.34 |
Zhixin Li | 2 | 0 | 0.68 |
Jingli Wu | 3 | 3 | 3.15 |
Huifang Ma | 4 | 0 | 1.35 |
Bianping Su | 5 | 0 | 0.34 |