Abstract | ||
---|---|---|
Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC. Motivated by the close connection between ReC and CLIP's contrastive pre-training objective, the first component of ReCLIP is a region-scoring method that isolates object proposals via cropping and blurring, and passes them to CLIP. However, through controlled experiments on a synthetic dataset, we find that CLIP is largely incapable of performing spatial reasoning off-the-shelf. Thus, the second component of ReCLIP is a spatial relation resolver that handles several types of spatial relations. We reduce the gap between zero-shot baselines from prior work and supervised models by as much as 29% on RefCOCOg, and on RefGTA (video game imagery), ReCLIP's relative improvement over supervised ReC models trained on real images is 8%. |
Year | DOI | Venue |
---|---|---|
2022 | 10.18653/v1/2022.acl-long.357 | PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS) |
DocType | Volume | Citations |
Conference | Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) | 1 |
PageRank | References | Authors |
0.40 | 0 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Sanjay Subramanian | 1 | 1 | 3.78 |
Will Merrill | 2 | 1 | 0.40 |
Trevor Darrell | 3 | 22413 | 1800.67 |
Matthew Gardner | 4 | 704 | 38.49 |
Sameer Singh | 5 | 1060 | 71.63 |
Anna Rohrbach | 6 | 411 | 21.90 |