Abstract | ||
---|---|---|
Image captioning has received significant attention with remarkable improvements in recent advances. Nevertheless, images in the wild encapsulate rich knowledge and cannot be sufficiently described with models built on image-caption pairs containing only in-domain objects. In this paper, we propose to address the problem by augmenting standard deep captioning architectures with object learners. Specifically, we present Long Short-Term Memory with Pointing (LSTM-P) - a new architecture that facilitates vocabulary expansion and produces novel objects via pointing mechanism. Technically, object learners are initially pre-trained on available object recognition data. Pointing in LSTM-P then balances the probability between generating a word through LSTM and copying a word from the recognized objects at each time step in decoder stage. Furthermore, our captioning encourages global coverage of objects in the sentence. Extensive experiments are conducted on both held-out COCO image captioning and ImageNet datasets for describing novel objects, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, we obtain an average of 60.9% in F1 score on held-out COCO dataset. |
Year | DOI | Venue |
---|---|---|
2019 | 10.1109/CVPR.2019.01278 | 2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019) |
Field | DocType | Volume |
F1 score,Architecture,Closed captioning,Pattern recognition,Computer science,Copying,Speech recognition,Artificial intelligence,Sentence,Vocabulary,Cognitive neuroscience of visual object recognition | Journal | abs/1904.11251 |
ISSN | Citations | PageRank |
1063-6919 | 4 | 0.41 |
References | Authors | |
0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yehao Li | 1 | 75 | 8.57 |
Ting Yao | 2 | 842 | 52.62 |
Yingwei Pan | 3 | 357 | 23.66 |
Hongyang Chao | 4 | 495 | 36.96 |
Tao Mei | 5 | 4702 | 288.54 |