Abstract | ||
---|---|---|
Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the cross-modal interaction either via the similarity of the global feature of each modality which misses sufficient information, or finer-grained interactions using cross/self-attention upon visual and textual tokens. However, cross/self-attention suffers from inferior efficiency in both training and inference. In this paper, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training (FILIP) to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. FILIP successfully leverages the finer-grained expressiveness between image patches and textual words by modifying only contrastive loss, while simultaneously gaining the ability to pre-compute image and text representations offline at inference, keeping both large-scale training and inference efficient. Furthermore, we construct a new large-scale image-text pair dataset called FILIP300M for pre-training. Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks including zero-shot image classification and image-text retrieval. The visualization on word-patch alignment further shows that FILIP can learn meaningful fine-grained features with promising localization ability. |
Year | Venue | Keywords |
---|---|---|
2022 | International Conference on Learning Representations (ICLR) | Visual-language pretraining,Language-Image Pretraining,Multi-modality model |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 10 |
Name | Order | Citations | PageRank |
---|---|---|---|
Lewei Yao | 1 | 0 | 1.35 |
Runhui Huang | 2 | 0 | 0.34 |
lu hou | 3 | 62 | 6.80 |
Guansong Lu | 4 | 15 | 1.95 |
Minzhe Niu | 5 | 0 | 1.69 |
Hang Xu | 6 | 7 | 9.91 |
Xiaodan Liang | 7 | 1096 | 77.53 |
Zhenguo Li | 8 | 581 | 41.17 |
Xin Jiang | 9 | 150 | 32.43 |
Chunjing Xu | 10 | 0 | 2.03 |