Title
Using Neural Encoder-Decoder Models With Continuous Outputs for Remote Sensing Image Captioning
Abstract
Remote sensing image captioning involves generating a concise textual description for an input aerial image. The task has received significant attention, and several recent proposals are based on neural encoder-decoder models. Most previous methods are trained to generate discrete outputs corresponding to word tokens that match the reference sentences word-by-word, thereby optimizing the generation locally at token-level instead of globally at sentence-level. This paper explores an alternative generation method based on continuous outputs, which generates sequences of embedding vectors instead of directly predicting word tokens. We argue that continuous output models have the potential to better capture the global semantic similarity between captions and images, e.g. by facilitating the use of loss functions matching different views of the data. This includes comparing representations for individual tokens and for the entire captions, and also comparing captions against intermediate image representations. We experimentally compare discrete versus continuous output captioning methods over the UCM and RSICD datasets, which are extensively used in the area despite some issues which we also discuss. Results show that the alternative encoder-decoder framework with continuous outputs can indeed lead to better results on the two datasets, compared to the standard approach based on discrete outputs. The proposed approach is also competitive against the state-of-the-art model in the area.
Year
DOI
Venue
2022
10.1109/ACCESS.2022.3151874
IEEE ACCESS
Keywords
DocType
Volume
Remote sensing, Decoding, Standards, Task analysis, Semantics, Training, Predictive models, Deep neural networks, natural language generation, remote sensing image captioning
Journal
10
ISSN
Citations 
PageRank 
2169-3536
0
0.34
References 
Authors
0
2
Name
Order
Citations
PageRank
Rita Ramos100.34
Bruno Martins244134.58