Abstract | ||
---|---|---|
Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art. |
Year | DOI | Venue |
---|---|---|
2020 | 10.1109/CVPR42600.2020.01045 | CVPR |
DocType | Citations | PageRank |
Conference | 9 | 0.54 |
References | Authors | |
26 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jiasen Lu | 1 | 544 | 16.43 |
Goswami Vedanuj | 2 | 9 | 3.24 |
Marcus Rohrbach | 3 | 3138 | 107.83 |
Devi Parikh | 4 | 2929 | 132.01 |
Stefan Lee | 5 | 231 | 19.88 |