Title
Be Specific, Be Clear: Bridging Machine and Human Captions by Scene-Guided Transformer
Abstract
ABSTRACTAutomatically generating natural language descriptions for images, i.e., image captioning, is one of the primary goals for multimedia understanding. The recent success of deep neural networks in image captioning has been accompanied by region-based bottom-up-attention features. Region-based features are representative of the contents of local regions while lacking an overall understanding of images, which is critical to more specific and clear language expression. Visual scene perception can facilitate overall understanding and provide prior knowledge to generate specific and clear captions of objects, object relations, and overall image scenes. In this paper, we propose a Scene-Guided Transformer (SG-Transformer) model that leverages the scene-level global context to generate more specific and descriptive image captions. SG-Transformer adopts an encoder-decoder architecture. The encoder aggregates global scene context as external knowledge with object region-based features in attention learning to facilitate object relation reasoning. It also incorporates high-level auxiliary scene-guided tasks towards more specific visual representation learning. Then the decoder integrates both object-level and scene-level information refined by the encoder for an overall image perception. Extensive experiments on MSCOCO and Flickr30k benchmarks show the superiority and generality of SG-Transformer. Besides, the proposed scene-guided approach can enrich object-level and scene graph visual representations in the encoder and generalize to both RNN- and Transformer-based architectures in the decoder.
Year
DOI
Venue
2021
10.1145/3463945.3469054
International Multimedia Conference
Keywords
DocType
Citations 
image captioning, scene, context
Conference
0
PageRank 
References 
Authors
0.34
0
3
Name
Order
Citations
PageRank
Yupan Huang101.35
Zhaoyang Zeng212.06
Yutong Lu330753.61