Title
Coarse-to-fine dual-level attention for video-text cross modal retrieval
Abstract
The effective representation of video features plays an important role in video vs. text cross-modal retrieval, and many researchers either use a single modal feature of the video or simply combine multi-modal features of the video. This makes the learned video features less robust. To enhance the robustness of video feature representation, we use coarse-fine-grained parallel attention model and feature fusion module to learn more effective video feature representation. Among them, coarse-grained attention learns the relationship between different feature blocks in the same modality feature and fine-grained attention applies attention to global features and strengthens the connection between points. Coarse-grained attention and Fine-grained attention complement each other. We integrate multi-head attention network into the model to expand the receptive field for features, and use the feature fusion module to further reduce the semantic gap between different video modalities. Our proposed model architecture not only strengthens the relationship between global features and local features, but also compensates the differences between different modality features in the video. Evaluation on three widely used datasets AcitivityNet-Captions, MSRVTT and LSMDC demonstrates its effectiveness.
Year
DOI
Venue
2022
10.1016/j.knosys.2022.108354
Knowledge-Based Systems
Keywords
DocType
Volume
Video vs. text cross-modal retrieval,Coarse-fine-grained parallel attention,Multi-head attention,Feature fusion
Journal
242
ISSN
Citations 
PageRank 
0950-7051
0
0.34
References 
Authors
24
5
Name
Order
Citations
PageRank
Ming Jin100.34
Huaxiang Zhang243656.32
Lei Zhu385451.69
Jiande Sun423241.76
Li Liu5126461.72