Extracting person names from diverse and noisy OCR text - Citegraph

Paper Info

Title
Extracting person names from diverse and noisy OCR text

Abstract
Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficult can it be and what level of quality can be expected? We apply three typical extraction algorithms to the task of extracting person names from multiple types of noisy OCR documents found in the collection of a major genealogy content provider and compare their performance using a number of quality metrics. We also show an improvement in extraction quality using a majority-vote ensemble of the three extractors. We evaluate the extraction quality with respect to two references: what a human can manually extract from OCR output and from the original document images. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement.

Year	DOI	Venue
2010	10.1145/1871840.1871845	AND
Keywords	Field	DocType
ocred historical document,noisy ocr text,typical extraction algorithm,noisy ocr document,historical information,historical corpus,quality metrics,extracting person name,entity recognition,ocred data,ocr output,extraction quality,information extraction,majority voting	Page layout,Discoverability,Information retrieval,Computer science,Information extraction,Artificial intelligence,Natural language processing,Named-entity recognition	Conference
Citations	PageRank	References
6	0.68	14
Authors
7

Authors (7 rows)

Cited by (6 rows)

References (14 rows)

Name	Order	Citations	PageRank
Thomas L. Packer	1	12	2.29
Joshua F. Lutes	2	6	0.68
Aaron P. Stewart	3	6	0.68
David W. Embley	4	1915	480.08
Eric K. Ringger	5	272	39.24
Kevin D. Seppi	6	335	41.46
Lee S. Jensen	7	144	8.86

1