Abstract | ||
---|---|---|
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. Incorporating motion information from optical flow maps with appearance and linguistic modalities is crucial yet has been largely ignored by previous work. In this paper, we design a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation. Specifically, we propose a multi-modal video transformer, which can fuse and aggregate multi-modal and temporal features between frames. Furthermore, we design a language-guided feature fusion module to progressively fuse appearance and motion features in each feature level with guidance from linguistic features. Finally, a multi-modal alignment loss is proposed to alleviate the semantic gap between features from different modalities. Extensive experiments on A2D Sentences and J-HMDB Sentences verify the performance and the generalization ability of our method compared to the state-of-the-art methods. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/CVPR52688.2022.01144 | IEEE Conference on Computer Vision and Pattern Recognition |
Keywords | DocType | Volume |
Segmentation,grouping and shape analysis, Video analysis and understanding, Vision + language | Conference | 2022 |
Issue | Citations | PageRank |
1 | 0 | 0.34 |
References | Authors | |
0 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Wangbo Zhao | 1 | 0 | 0.34 |
Kai Wang | 2 | 1734 | 195.03 |
Xiangxiang Chu | 3 | 2 | 1.39 |
Fuzhao Xue | 4 | 0 | 0.34 |
Xinchao Wang | 5 | 474 | 43.70 |
Yang You | 6 | 0 | 0.34 |