Title
Quality Prediction System for Large-Scale Digitisation Workflows
Abstract
The feasibility of large-scale OCR projects can so far only be assessed by running pilot studies on subsets of the target document collections and measuring the success of different workflows based on precise ground truth, which can be very costly to produce in the required volume. The premise of this paper is that, as an alternative, quality prediction may be used to approximate the success of a given OCR workflow. A new system is thus presented where a classifier is trained using metadata, image and layout features in combination with measured success rates (based on minimal ground truth). Subsequently, only document images are required as input for the numeric prediction of the quality score (no ground truth required). This way, the system can be applied to any number of similar (unseen) documents in order to assess their suitability for being processed using the particular workflow. The usefulness of the system has been validated using a realistic dataset of historical newspaper pages.
Year
DOI
Venue
2016
10.1109/DAS.2016.82
2016 12th IAPR Workshop on Document Analysis Systems (DAS)
Keywords
Field
DocType
Document analysis,Quality prediction,Digitisation,Performance evaluation,Supervised learning,Numeric prediction,Ground truthing,Large-scale
Data mining,Metadata,Quality Score,Computer science,Supervised learning,Premise,Ground truth,Classifier (linguistics),Workflow,Prediction system
Conference
Citations 
PageRank 
References 
1
0.37
6
Authors
3
Name
Order
Citations
PageRank
Christian Clausner1448.49
stefan pletschacher221620.78
Apostolos Antonacopoulos337836.45