Title
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
Abstract
Yidco-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a standard transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning finegrained visual-language alignment usually requires off-the-shelf object detectors to provide object information, which is bottlenecked by the detector's limited vocabulary and expensive computation cost. In this paper, we propose Align and Prompt: a new video-and-language pre-training framework (AlPro), which operates on sparsely-sampled video frames and achieves more effective cross-modal alignment without explicit object detectors. First, we introduce a video-text contrastive (VTC) loss to align unimodal video-text features at the instance level, which eases the modeling of cross-modal interactions. Then, we propose a novel visually-grounded pre-training task, prompting entity modeling (PEM), which learns finegrained alignment between visual region and text entity via an entity prompter module in a self-supervised way. Finally, we pretrain the video-and-language transformer models on large webly-source video-text pairs using the proposed VTC and PEM losses as well as two standard losses of masked language modeling (MLM) and video-text matching (VTM). The resulting pre-trained model achieves state-of-the-art performance on both text-video retrieval and videoQA, outperforming prior work by a substantial margin. Implementation and pre-trained models are available at https://github.com/salesforce/ALPRO.
Year
DOI
Venue
2022
10.1109/CVPR52688.2022.00490
IEEE Conference on Computer Vision and Pattern Recognition
Keywords
DocType
Volume
Vision + language, Video analysis and understanding
Conference
2022
Issue
Citations 
PageRank 
1
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Dongxu Li143.77
Junnan Li25810.46
Hongdong Li31724101.81
Juan Carlos Niebles400.68
Steven C. H. Hoi53830174.61