Title
Towards Robust Video Text Detection with Spatio-Temporal Attention Modeling and Text Cues Fusion
Abstract
Information carried by video text is of great value to various video applications. However, detecting text in videos of- ten faces great challenges due to the widely varied appearance of text and the complicated, dynamic video context. In this paper, we propose a robust video text detection network that adaptively combines relevant text cues in multiple frames with spatio-temporal attention and fusion mechanisms, which effectively enhance the accuracy and robustness of video text detection compared to single-frame detection. The network first localizes text region proposals and propagates them across frames with an R-CNN based framework. Then, a Transformer-based cross-frame feature fusion model is employed to attentively select and combine relevant text features, yielding an enhanced representation of text region integrating complementary text cues for robust text candidate prediction. The network achieves competitive text detection performance on standard video text benchmarks, demonstrating the effectiveness of the proposed method.
Year
DOI
Venue
2022
10.1109/ICME52920.2022.9859582
2022 IEEE International Conference on Multimedia and Expo (ICME)
Keywords
DocType
ISSN
Video text,detection,Transformer,fusion,attention
Conference
1945-7871
ISBN
Citations 
PageRank 
978-1-6654-8564-7
0
0.34
References 
Authors
0
2
Name
Order
Citations
PageRank
Long Chen100.34
Feng Su217018.63