Abstract | ||
---|---|---|
Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficult can it be and what level of quality can be expected? We apply three typical extraction algorithms to the task of extracting person names from multiple types of noisy OCR documents found in the collection of a major genealogy content provider and compare their performance using a number of quality metrics. We also show an improvement in extraction quality using a majority-vote ensemble of the three extractors. We evaluate the extraction quality with respect to two references: what a human can manually extract from OCR output and from the original document images. We illustrate the challenges and opportunities at hand for extracting names from OCRed data and identify directions for further improvement. |
Year | DOI | Venue |
---|---|---|
2010 | 10.1145/1871840.1871845 | AND |
Keywords | Field | DocType |
ocred historical document,noisy ocr text,typical extraction algorithm,noisy ocr document,historical information,historical corpus,quality metrics,extracting person name,entity recognition,ocred data,ocr output,extraction quality,information extraction,majority voting | Page layout,Discoverability,Information retrieval,Computer science,Information extraction,Artificial intelligence,Natural language processing,Named-entity recognition | Conference |
Citations | PageRank | References |
6 | 0.68 | 14 |
Authors | ||
7 |
Name | Order | Citations | PageRank |
---|---|---|---|
Thomas L. Packer | 1 | 12 | 2.29 |
Joshua F. Lutes | 2 | 6 | 0.68 |
Aaron P. Stewart | 3 | 6 | 0.68 |
David W. Embley | 4 | 1915 | 480.08 |
Eric K. Ringger | 5 | 272 | 39.24 |
Kevin D. Seppi | 6 | 335 | 41.46 |
Lee S. Jensen | 7 | 144 | 8.86 |