Towards Robust Video Text Detection with Spatio-Temporal Attention Modeling and Text Cues Fusion - Citegraph

Paper Info

Title
Towards Robust Video Text Detection with Spatio-Temporal Attention Modeling and Text Cues Fusion

Abstract
Information carried by video text is of great value to various video applications. However, detecting text in videos of- ten faces great challenges due to the widely varied appearance of text and the complicated, dynamic video context. In this paper, we propose a robust video text detection network that adaptively combines relevant text cues in multiple frames with spatio-temporal attention and fusion mechanisms, which effectively enhance the accuracy and robustness of video text detection compared to single-frame detection. The network first localizes text region proposals and propagates them across frames with an R-CNN based framework. Then, a Transformer-based cross-frame feature fusion model is employed to attentively select and combine relevant text features, yielding an enhanced representation of text region integrating complementary text cues for robust text candidate prediction. The network achieves competitive text detection performance on standard video text benchmarks, demonstrating the effectiveness of the proposed method.

Year	DOI	Venue
2022	10.1109/ICME52920.2022.9859582	2022 IEEE International Conference on Multimedia and Expo (ICME)
Keywords	DocType	ISSN
Video text,detection,Transformer,fusion,attention	Conference	1945-7871
ISBN	Citations	PageRank
978-1-6654-8564-7	0	0.34
References	Authors
0	2

Authors (2 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Long Chen	1	0	0.34
Feng Su	2	170	18.63

1