Coarse-to-fine dual-level attention for video-text cross modal retrieval - Citegraph

Paper Info

Title
Coarse-to-fine dual-level attention for video-text cross modal retrieval

Abstract
The effective representation of video features plays an important role in video vs. text cross-modal retrieval, and many researchers either use a single modal feature of the video or simply combine multi-modal features of the video. This makes the learned video features less robust. To enhance the robustness of video feature representation, we use coarse-fine-grained parallel attention model and feature fusion module to learn more effective video feature representation. Among them, coarse-grained attention learns the relationship between different feature blocks in the same modality feature and fine-grained attention applies attention to global features and strengthens the connection between points. Coarse-grained attention and Fine-grained attention complement each other. We integrate multi-head attention network into the model to expand the receptive field for features, and use the feature fusion module to further reduce the semantic gap between different video modalities. Our proposed model architecture not only strengthens the relationship between global features and local features, but also compensates the differences between different modality features in the video. Evaluation on three widely used datasets AcitivityNet-Captions, MSRVTT and LSMDC demonstrates its effectiveness.

Year	DOI	Venue
2022	10.1016/j.knosys.2022.108354	Knowledge-Based Systems
Keywords	DocType	Volume
Video vs. text cross-modal retrieval,Coarse-fine-grained parallel attention,Multi-head attention,Feature fusion	Journal	242
ISSN	Citations	PageRank
0950-7051	0	0.34
References	Authors
24	5

Authors (5 rows)

Cited by (0 rows)

References (24 rows)

Name	Order	Citations	PageRank
Ming Jin	1	0	0.34
Huaxiang Zhang	2	436	56.32
Lei Zhu	3	854	51.69
Jiande Sun	4	232	41.76
Li Liu	5	1264	61.72

1