Two Semi-Supervised Training Approaches for Automated Text Recognition - Citegraph

Paper Info

Title
Two Semi-Supervised Training Approaches for Automated Text Recognition

Abstract
Automated text recognition is a fundamental problem in Document Image Analysis. Optical models are used for modeling characters while language models are used for composing sentences. Since the scripts and linguistic context differ widely, it is mandatory to specialize the models by training on task-dependent ground-truth. However, to create a sufficient amount of ground-truth, at least for historical handwritten scripts, well-qualified persons have to mark and transcribe text lines, which is very time-consuming. On the other hand, in many cases unassigned transcripts are already available on page-level from another process chain, or at least transcripts from similar linguistic context are available. In this work we present two approaches that make use of such transcripts: whereas the first one creates training data by automatically assigning page-dependent transcripts to text lines, the second one uses a task-specific language model to generate highly confident training data. Both approaches are successfully applied on a very challenging historical handwritten collection.

Year	DOI	Venue
2020	10.1109/ICFHR2020.2020.00036	2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR)
Keywords	DocType	ISSN
semi-supervised training,text-image alignment,handwritten text recognition,HTR,automated text recognition,ATR	Conference	2167-6445
ISBN	Citations	PageRank
978-1-7281-9967-2	0	0.34
References	Authors
4	3

Authors (3 rows)

Cited by (0 rows)

References (4 rows)

Name	Order	Citations	PageRank
Gundram Leifert	1	27	5.80
Roger Labahn	2	24	4.90
Joan-Andreu Sánchez	3	198	29.00

1