Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer - Citegraph

Paper Info

Title
Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

Abstract
Remote sensing image captioning (RSIC) has great significance in image understanding, which describes the image content in natural language. Existing methods are mainly based on deep learning and rely on the encoder-decoder model to generate sentences. In the decoding process, recurrent neural network (RNN) and long short-term memory (LSTM) are normally applied to sequentially generate image captions. In this letter, the transformer encoder-decoder is combined with grid features to improve the RSIC performance. First, the pretrained convolutional neural network (CNN) is used to extract grid-based visual features, which are encoded as vectorial representations. Then, the transformer outputs semantic descriptions to bridge visual features and natural language. Besides, the self-critical sequence training (SCST) strategy is applied to further optimize the image captioning model and improve the quality of generated sentences. Extensive experiments are organized on three public datasets of RSCID, UCM-Captions, and Sydney-Captions. Experimental results demonstrate the effectiveness of SCST strategy and the proposed method achieves superior performance compared with the state-of-the-art image captioning approaches on the RSCID dataset.

Year	DOI	Venue
2022	10.1109/LGRS.2021.3135711	IEEE GEOSCIENCE AND REMOTE SENSING LETTERS
Keywords	DocType	Volume
Feature extraction, Transformers, Decoding, Visualization, Training, Measurement, Semantics, Convolutional neural network (CNN), image captioning, remote sensing, transformer	Journal	19
ISSN	Citations	PageRank
1545-598X	0	0.34
References	Authors
0	6

Authors (6 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Shuo Zhuang	1	0	0.68
ping wang	2	104	17.46
Gang Wang	3	2869	135.49
Di Wang	4	1337	143.48
Jinyong Chen	5	1	1.03
Feng Gao	6	182	45.20

1