Abstract | ||
---|---|---|
Yidco-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a standard transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning finegrained visual-language alignment usually requires off-the-shelf object detectors to provide object information, which is bottlenecked by the detector's limited vocabulary and expensive computation cost. In this paper, we propose Align and Prompt: a new video-and-language pre-training framework (AlPro), which operates on sparsely-sampled video frames and achieves more effective cross-modal alignment without explicit object detectors. First, we introduce a video-text contrastive (VTC) loss to align unimodal video-text features at the instance level, which eases the modeling of cross-modal interactions. Then, we propose a novel visually-grounded pre-training task, prompting entity modeling (PEM), which learns finegrained alignment between visual region and text entity via an entity prompter module in a self-supervised way. Finally, we pretrain the video-and-language transformer models on large webly-source video-text pairs using the proposed VTC and PEM losses as well as two standard losses of masked language modeling (MLM) and video-text matching (VTM). The resulting pre-trained model achieves state-of-the-art performance on both text-video retrieval and videoQA, outperforming prior work by a substantial margin. Implementation and pre-trained models are available at https://github.com/salesforce/ALPRO. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/CVPR52688.2022.00490 | IEEE Conference on Computer Vision and Pattern Recognition |
Keywords | DocType | Volume |
Vision + language, Video analysis and understanding | Conference | 2022 |
Issue | Citations | PageRank |
1 | 0 | 0.34 |
References | Authors | |
0 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Dongxu Li | 1 | 4 | 3.77 |
Junnan Li | 2 | 58 | 10.46 |
Hongdong Li | 3 | 1724 | 101.81 |
Juan Carlos Niebles | 4 | 0 | 0.68 |
Steven C. H. Hoi | 5 | 3830 | 174.61 |