Video Semantic Segmentation via Sparse Temporal Transformer - Citegraph

Paper Info

Title
Video Semantic Segmentation via Sparse Temporal Transformer

Abstract
ABSTRACTCurrently, video semantic segmentation mainly faces two challenges: 1) the demand of temporal consistency; 2) the balance between segmentation accuracy and inference efficiency. For the first challenge, existing methods usually use optical flow to capture the temporal relation in consecutive frames and maintain the temporal consistency, but the low inference speed by means of optical flow limits the real-time applications. For the second challenge, flow based key frame warping is one mainstream solution. However, the unbalanced inference latency of flow-based key frame warping makes it unsatisfactory for real-time applications. Considering the segmentation accuracy and inference efficiency, we propose a novel Sparse Temporal Transformer (STT) to bridge temporal relation among video frames adaptively, which is also equipped with query selection and key selection. The key selection and query selection strategies are separately applied to filter out temporal and spatial redundancy in our temporal transformer. Specifically, our STT can reduce the time complexity of temporal transformer by a large margin without harming the segmentation accuracy and temporal consistency. Experiments on two benchmark datasets, Cityscapes and Camvid, demonstrate that our method achieves the state-of-the-art segmentation accuracy and temporal consistency with comparable inference speed.

Year	DOI	Venue
2021	10.1145/3474085.3475409	International Multimedia Conference
DocType	Citations	PageRank
Conference	0	0.34
References	Authors
0	7

Authors (7 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Jiangtong Li	1	19	4.31
Wentao Wang	2	8	8.60
Junjie Chen	3	0	1.69
Li Niu	4	0	0.34
Jianlou Si	5	3	1.38
Qian Chen	6	45	11.68
Liqing Zhang	7	2713	181.40

1