Object-Agnostic Transformers for Video Referring Segmentation - Citegraph

Paper Info

Title
Object-Agnostic Transformers for Video Referring Segmentation

Abstract
Video referring segmentation focuses on segmenting out the object in a video based on the corresponding textual description. Previous works have primarily tackled this task by devising two crucial parts, an intra-modal module for context modeling and an inter-modal module for heterogeneous alignment. However, there are two essential drawbacks of this approach: (1) it lacks joint learning of context modeling and heterogeneous alignment, leading to insufficient interactions among input elements; (2) both modules require task-specific expert knowledge to design, which severely limits the flexibility and generality of prior methods. To address these problems, we here propose a novel Object-Agnostic Transformer-based Network, called OATNet, that simultaneously conducts intra-modal and inter-modal learning for video referring segmentation, without the aid of object detection or category-specific pixel labeling. More specifically, we first directly feed the sequence of textual tokens and visual tokens (pixels rather than detected object bounding boxes) into a multi-modal encoder, where context and alignment are simultaneously and effectively explored. We then design a novel cascade segmentation network to decouple our task into coarse-grained segmentation and fine-grained refinement. Moreover, considering the difficulty of samples, a more balanced metric is provided to better diagnose the performance of the proposed method. Extensive experiments on two popular datasets, A2D Sentences and J-HMDB Sentences, demonstrate that our proposed approach noticeably outperforms state-of-the-art methods.

Year	DOI	Venue
2022	10.1109/TIP.2022.3161832	IEEE TRANSACTIONS ON IMAGE PROCESSING
Keywords	DocType	Volume
Task analysis, Visualization, Transformers, Feature extraction, Object detection, Image segmentation, Context modeling, Video referring segmentation, multi-modal learning, video grounding	Journal	31
Issue	ISSN	Citations
1	1057-7149	0
PageRank	References	Authors
0.34	9	5

Authors (5 rows)

Cited by (0 rows)

References (9 rows)

Name	Order	Citations	PageRank
Xu Yang	1	45	8.16
Hao Wang	2	18	4.34
De Xie	3	9	1.84
Cheng Deng	4	1283	85.48
Dacheng Tao	5	19032	747.78

1