Abstract | ||
---|---|---|
Description-based person search aims to retrieve a person in the image database based on a description about that person. It is a challenging task since the visual image and the textual description belong to different modalities. To fully capture the relevance between person images and textual descriptions, we propose a multigrained framework with three branches for visual-textual matching. Specifically, in the global-grained branch, we extract global contexts from the entire images and descriptions. In the fine-grained branch, we adopt visual human parsing and linguistic parsing to split images and descriptions into semantic components related to different body parts. We design two attention mechanisms including segmentation-based and linguistics-based attention to align visual and textual semantic components for fine-grained matching. To further exploit the spatial relations between fine-grained semantic components, we construct a body graph in the coarse-grained branch and exploit graph convolutional neural networks to aggregate fine-grained components into coarsegrained representations. The visual and textual representations learned by three branches are complementary to each other which enhance the visual-textual matching performance. Experimental results on the CUHK-PEDES dataset show that our approach performs favorably against state-of-the-art description-based person search methods. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1016/j.displa.2021.102039 | DISPLAYS |
Keywords | DocType | Volume |
Description-based person search, Visual-textual matching, Cross-modal matching, Attention mechanism, Multi-grained matching networks | Journal | 69 |
ISSN | Citations | PageRank |
0141-9382 | 0 | 0.34 |
References | Authors | |
0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Ji Zhu | 1 | 0 | 0.34 |
Hua Yang | 2 | 2 | 3.43 |
Jia Wang | 3 | 0 | 0.34 |
Wenjun Zhang | 4 | 1789 | 177.28 |