Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation - Citegraph

Paper Info

Title
Modeling Motion with Multi-Modal Features for Text-Based Video Segmentation

Abstract
Text-based video segmentation aims to segment the target object in a video based on a describing sentence. Incorporating motion information from optical flow maps with appearance and linguistic modalities is crucial yet has been largely ignored by previous work. In this paper, we design a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation. Specifically, we propose a multi-modal video transformer, which can fuse and aggregate multi-modal and temporal features between frames. Furthermore, we design a language-guided feature fusion module to progressively fuse appearance and motion features in each feature level with guidance from linguistic features. Finally, a multi-modal alignment loss is proposed to alleviate the semantic gap between features from different modalities. Extensive experiments on A2D Sentences and J-HMDB Sentences verify the performance and the generalization ability of our method compared to the state-of-the-art methods.

Year	DOI	Venue
2022	10.1109/CVPR52688.2022.01144	IEEE Conference on Computer Vision and Pattern Recognition
Keywords	DocType	Volume
Segmentation,grouping and shape analysis, Video analysis and understanding, Vision + language	Conference	2022
Issue	Citations	PageRank
1	0	0.34
References	Authors
0	6

Authors (6 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Wangbo Zhao	1	0	0.34
Kai Wang	2	1734	195.03
Xiangxiang Chu	3	2	1.39
Fuzhao Xue	4	0	0.34
Xinchao Wang	5	474	43.70
Yang You	6	0	0.34

1