Align and Prompt: Video-and-Language Pre-training with Entity Prompts - Citegraph

Paper Info

Title
Align and Prompt: Video-and-Language Pre-training with Entity Prompts

Abstract
Yidco-and-language pre-training has shown promising improvements on various downstream tasks. Most previous methods capture cross-modal interactions with a standard transformer-based multimodal encoder, not fully addressing the misalignment between unimodal video and text features. Besides, learning finegrained visual-language alignment usually requires off-the-shelf object detectors to provide object information, which is bottlenecked by the detector's limited vocabulary and expensive computation cost. In this paper, we propose Align and Prompt: a new video-and-language pre-training framework (AlPro), which operates on sparsely-sampled video frames and achieves more effective cross-modal alignment without explicit object detectors. First, we introduce a video-text contrastive (VTC) loss to align unimodal video-text features at the instance level, which eases the modeling of cross-modal interactions. Then, we propose a novel visually-grounded pre-training task, prompting entity modeling (PEM), which learns finegrained alignment between visual region and text entity via an entity prompter module in a self-supervised way. Finally, we pretrain the video-and-language transformer models on large webly-source video-text pairs using the proposed VTC and PEM losses as well as two standard losses of masked language modeling (MLM) and video-text matching (VTM). The resulting pre-trained model achieves state-of-the-art performance on both text-video retrieval and videoQA, outperforming prior work by a substantial margin. Implementation and pre-trained models are available at https://github.com/salesforce/ALPRO.

Year	DOI	Venue
2022	10.1109/CVPR52688.2022.00490	IEEE Conference on Computer Vision and Pattern Recognition
Keywords	DocType	Volume
Vision + language, Video analysis and understanding	Conference	2022
Issue	Citations	PageRank
1	0	0.34
References	Authors
0	5

Authors (5 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Dongxu Li	1	4	3.77
Junnan Li	2	58	10.46
Hongdong Li	3	1724	101.81
Juan Carlos Niebles	4	0	0.68
Steven C. H. Hoi	5	3830	174.61

1