Title
Two Semi-Supervised Training Approaches for Automated Text Recognition
Abstract
Automated text recognition is a fundamental problem in Document Image Analysis. Optical models are used for modeling characters while language models are used for composing sentences. Since the scripts and linguistic context differ widely, it is mandatory to specialize the models by training on task-dependent ground-truth. However, to create a sufficient amount of ground-truth, at least for historical handwritten scripts, well-qualified persons have to mark and transcribe text lines, which is very time-consuming. On the other hand, in many cases unassigned transcripts are already available on page-level from another process chain, or at least transcripts from similar linguistic context are available. In this work we present two approaches that make use of such transcripts: whereas the first one creates training data by automatically assigning page-dependent transcripts to text lines, the second one uses a task-specific language model to generate highly confident training data. Both approaches are successfully applied on a very challenging historical handwritten collection.
Year
DOI
Venue
2020
10.1109/ICFHR2020.2020.00036
2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR)
Keywords
DocType
ISSN
semi-supervised training,text-image alignment,handwritten text recognition,HTR,automated text recognition,ATR
Conference
2167-6445
ISBN
Citations 
PageRank 
978-1-7281-9967-2
0
0.34
References 
Authors
4
3
Name
Order
Citations
PageRank
Gundram Leifert1275.80
Roger Labahn2244.90
Joan-Andreu Sánchez319829.00