Abstract | ||
---|---|---|
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by the use of hard negatives in structured prediction, and ranking loss functions used in retrieval, we introduce a simple change to common loss functions used to learn multi-modal embeddings. That, combined with fine-tuning and the use of augmented data, yields significant gains in retrieval performance. We showcase our approach, dubbed VSE++, on the MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval, and 11.3% in image retrieval (based on R@1). |
Year | Venue | Field |
---|---|---|
2018 | british machine vision conference | Ranking,Structured prediction,Image retrieval,Artificial intelligence,Mathematics,Machine learning |
DocType | Citations | PageRank |
Conference | 21 | 0.64 |
References | Authors | |
15 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Fartash Faghri | 1 | 61 | 3.88 |
David J. Fleet | 2 | 21 | 1.65 |
Jamie Ryan Kiros | 3 | 21 | 0.64 |
Sanja Fidler | 4 | 183 | 10.30 |