Title
Estimating and rating the quality of optically character recognised text
Abstract
The focus of this paper is on the quality of historical text digitised through optical character recognition (OCR) and how it affects text mining. We study the effect OCR errors have on named entity recognition (NER) and show that in a random sample of documents picked from several historical text collections, 30.6% of false negative commodity and location mentions and 13.3% of all manually annotated commodity and location mentions contain OCR errors. We introduce a simple method for estimating text quality of OCRed text and examine how well human raters can evaluate it. We also illustrate how automatic text quality estimation compares to manual rating with the aim of determining a quality threshold below which documents could potentially be discarded or would require extensive correction first. This work was conducted during the Trading Consequences project which focussed on text mining and visualisation of historical documents for the study of nineteenth century trade.
Year
DOI
Venue
2014
10.1145/2595188.2595214
DATeCH
Keywords
DocType
Citations 
digital library,xml,digital humanities,philosophy,application,corpus linguistics,full text search
Conference
8
PageRank 
References 
Authors
1.06
6
2
Name
Order
Citations
PageRank
Beatrice Alex123725.59
John Burns2132.59