Abstract | ||
---|---|---|
Image captioning models are usually evaluated on their ability to describe a held-out set of images, not on their ability to generalize to unseen concepts. We study the problem of compositional generalization, which measures how well a model composes unseen combinations of concepts when describing images. State-of-the-art image captioning models show poor generalization performance on this task. We propose a multi-task model to address the poor performance, that combines caption generation and image--sentence ranking, and uses a decoding mechanism that re-ranks the captions according their similarity to the image. This model is substantially better at generalizing to unseen combinations of concepts compared to state-of-the-art captioning models. |
Year | DOI | Venue |
---|---|---|
2019 | 10.18653/v1/k19-1009 | 2986729057 |
Field | DocType | Citations |
Closed captioning,Computer science,Artificial intelligence,Natural language processing | Conference | 0 |
PageRank | References | Authors |
0.34 | 0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Mitja Nikolaus | 1 | 0 | 1.01 |
Mostafa Abdou | 2 | 0 | 4.73 |
Matthew Lamm | 3 | 26 | 4.82 |
Rahul Aralikatte | 4 | 2 | 2.74 |
desmond elliott | 5 | 309 | 24.91 |