okralact - a multi-engine Open Source OCR training system - Citegraph

Paper Info

Title
okralact - a multi-engine Open Source OCR training system

Abstract
Optical character recognition (OCR) of historical documents has been significantly more difficult than OCR of modern texts largely due to idiosyncrasies and wide variability of font, layout, language, orthography of printed texts before ca. 1850. However, traditional OCR engines were optimized towards supporting the widest possible set of modern text ("OmniFont OCR") with little or no facilities for the user to adapt the engine. Since OCR technologies began embracing deep neural networks, various Free Software OCR engines are now available that can in principle be adapted to different types of documents by training specific models from ground truth (GT). What these engines offer in terms of implementation finesse, they lack in interoperability and standardization. To overcome this, we developed okralact, a set of specifications and a prototypical implementation of an engine-agnostic system for training Open Source OCR engines like Tesseract, OCRopus, kraken or Calamari. We discuss training of these engines, compare their features, describe the specifications and functionality of okralact and outline how a turn-key system for adapting Open Source OCR engines can contribute to better OCR for historical documents and to the general Open Source OCR ecosystem.

Year	DOI	Venue
2019	10.1145/3352631.3352638	Proceedings of the 5th International Workshop on Historical Document Imaging and Processing
Field	DocType	ISBN
Training system,Computer science,Computer hardware	Conference	978-1-4503-7668-6
Citations	PageRank	References
0	0.34	0
Authors
3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Konstantin Baierer	1	5	1.29
Rui Dong	2	5	6.89
Clemens Neudecker	3	2	2.05

1