Abstract | ||
---|---|---|
In this work, we address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression. Most existing methods focus on establishing unidirectional or directional relationships between visual and linguistic features to associate two modalities together, while the multi-scale context is ignored or insufficiently modeled. Multi-scale context is crucial to localize and segment those objects that have large scale variations during the multi-modal fusion process. To solve this problem, we propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel and further introduces a cascaded branch to fuse visual and linguistic features. The cascaded branch can progressively integrate multi-scale contextual information and facilitate the alignment of two modalities during the multi-modal fusion process. Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods. Code is available at https://github.com/jianhua2022/CMF-Refseg. |
Year | DOI | Venue |
---|---|---|
2021 | 10.1109/ICIP42928.2021.9506483 | ICIP |
DocType | Citations | PageRank |
Conference | 0 | 0.34 |
References | Authors | |
0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jianhua Yang | 1 | 80 | 12.95 |
Yan Huang | 2 | 226 | 27.65 |
Zhanyu Ma | 3 | 539 | 55.74 |
Liang Wang | 4 | 4317 | 243.28 |