Title
okralact - a multi-engine Open Source OCR training system
Abstract
Optical character recognition (OCR) of historical documents has been significantly more difficult than OCR of modern texts largely due to idiosyncrasies and wide variability of font, layout, language, orthography of printed texts before ca. 1850. However, traditional OCR engines were optimized towards supporting the widest possible set of modern text ("OmniFont OCR") with little or no facilities for the user to adapt the engine. Since OCR technologies began embracing deep neural networks, various Free Software OCR engines are now available that can in principle be adapted to different types of documents by training specific models from ground truth (GT). What these engines offer in terms of implementation finesse, they lack in interoperability and standardization. To overcome this, we developed okralact, a set of specifications and a prototypical implementation of an engine-agnostic system for training Open Source OCR engines like Tesseract, OCRopus, kraken or Calamari. We discuss training of these engines, compare their features, describe the specifications and functionality of okralact and outline how a turn-key system for adapting Open Source OCR engines can contribute to better OCR for historical documents and to the general Open Source OCR ecosystem.
Year
DOI
Venue
2019
10.1145/3352631.3352638
Proceedings of the 5th International Workshop on Historical Document Imaging and Processing
Field
DocType
ISBN
Training system,Computer science,Computer hardware
Conference
978-1-4503-7668-6
Citations 
PageRank 
References 
0
0.34
0
Authors
3
Name
Order
Citations
PageRank
Konstantin Baierer151.29
Rui Dong256.89
Clemens Neudecker322.05