Title | ||
---|---|---|
Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training |
Abstract | ||
---|---|---|
ABSTRACTIn this work, we present Auto-captions on GIF (ACTION), which is a new large-scale pre-training dataset for generic video understanding. All video-sentence pairs are created by automatically extracting and filtering video caption annotations from billions of web pages. Auto-captions on GIF dataset can be utilized to pre-train the generic feature representation or encoder-decoder structure for video captioning, and other downstream tasks (e.g., sentence localization in videos, video question answering, etc.) as well. We present a detailed analysis of Auto-captions on GIF dataset in comparison to existing video-sentence datasets. We also provide an evaluation of a Transformer-based encoder-decoder structure for vision-language pre-training, which is further adapted to video captioning downstream task and yields the compelling generalizability on MSR-VTT. The dataset is available at http://www.auto-video-captions.top/2022/dataset. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1145/3503161.3551581 | International Multimedia Conference |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yingwei Pan | 1 | 357 | 23.66 |
Yehao Li | 2 | 75 | 8.57 |
Jianjie Luo | 3 | 0 | 0.34 |
Jun Xu | 4 | 72 | 2.20 |
Ting Yao | 5 | 842 | 52.62 |
Tao Mei | 6 | 4702 | 288.54 |