Title
Extracting person names from diverse and noisy OCR text
Abstract
Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficult can it be and what level of quality can be expected? We apply three typical extraction algorithms to the task of extracting person names from multiple types of noisy OCR documents found in the collection of a major genealogy content provider and compare their performance using a number of quality metrics. We also show an improvement in extraction quality using a majority-vote ensemble of the three extractors. We evaluate the extraction quality with respect to two references: what a human can manually extract from OCR output and from the original document images. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement.
Year
DOI
Venue
2010
10.1145/1871840.1871845
AND
Keywords
Field
DocType
ocred historical document,noisy ocr text,typical extraction algorithm,noisy ocr document,historical information,historical corpus,quality metrics,extracting person name,entity recognition,ocred data,ocr output,extraction quality,information extraction,majority voting
Page layout,Discoverability,Information retrieval,Computer science,Information extraction,Artificial intelligence,Natural language processing,Named-entity recognition
Conference
Citations 
PageRank 
References 
6
0.68
14
Authors
7
Name
Order
Citations
PageRank
Thomas L. Packer1122.29
Joshua F. Lutes260.68
Aaron P. Stewart360.68
David W. Embley41915480.08
Eric K. Ringger527239.24
Kevin D. Seppi633541.46
Lee S. Jensen71448.86