Title
Robustly detect different types of text in videos
Abstract
Text in videos can be categorized into three types: overlaid text, layered text, and scene text. Existing detection methods focus on a specific type of text and cannot obtain a good performance when working on other text types. To our knowledge, few works explore to build a system to simultaneously detect all types of text. In this paper, we present a unified video text detector, which can simultaneously localize all types of text in videos accurately. Our system consists of a spatial text detector and a temporal fusion filter. First, we explore to use three different strategies to learn the spatial text detector based on deep convolutional neural networks, so that it can simultaneously detect various texts without knowing the types of text. Then, a new area-first non-maximum suppression computation combined with multiple constraints is proposed to remove the redundant bounding boxes. Finally, the temporal fusion filter exploits the features of spatial locations and text components to integrate the detection results of consecutive frames to further remove false positives. To validate the proposed approach, comprehensive experiments are carried out on three publicly available datasets, consisting of overlaid text, layered text, and scene text. The experimental results demonstrate that our method consistently achieves the best performance compared with state-of-the-art methods.
Year
DOI
Venue
2020
10.1007/s00521-020-04729-6
NEURAL COMPUTING & APPLICATIONS
Keywords
DocType
Volume
Video text detector,Temporal consistency,Spatial location,Component representation
Journal
32.0
Issue
ISSN
Citations 
16.0
0941-0643
0
PageRank 
References 
Authors
0.34
0
2
Name
Order
Citations
PageRank
Yuanqiang Cai122.05
Weiqiang Wang2138.65