Abstract | ||
---|---|---|
Cross-modal retrieval is a very challenging and significant task in intelligent understanding. Researchers have tried to capture modal semantic information through a weighted attention mechanism. Still, they cannot eliminate irrelevant semantic information's negative effects and cannot capture fine-grained modal semantic information. In order to further accurately capture the multi-modal semantic information, a bidirectional focused semantic alignment attention network (BFSAAN) is proposed to handle cross-modal retrieval tasks. Core ideas of BFSAAN are as follows: 1) Bidirectional focused attention mechanism is adopted to share modal semantic information, further eliminating the negative influence of irrelevant semantic information. 2) Strip pooling is applied to image and text modalities, a lightweight spatial attention mechanism to capture modal spatial semantic information. 3) Second-order covariance pooling is explored to obtain multi-modal semantic representation, capturing modal channel semantic information and achieving semantic alignment between image-text modalities. The experiment is executed in two standard cross-modal retrieval datasets (Flickr30K and MS COCO). The experimental design includes four aspects: performance comparison, ablation analysis, algorithm convergence, and visual analysis. Experimental results show that BFSAAN has better cross-modal retrieval performance. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/ICASSP39728.2021.9414382 | 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021) |
Keywords | DocType | Citations |
Cross-modal retrieval, bidirectional focused attention, semantic alignment, attention mechanism | Conference | 0 |
PageRank | References | Authors |
0.34 | 0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Shuli Cheng | 1 | 6 | 7.59 |
Liejun Wang | 2 | 7 | 2.86 |
Anyu Du | 3 | 4 | 4.19 |
Yongming Li | 4 | 0 | 0.34 |