Automatic Assessment Of Ocr Quality In Historical Documents - Citegraph

Paper Info

Title
Automatic Assessment Of Ocr Quality In Historical Documents

Abstract
Mass digitization of historical documents is a challenging problem for optical character recognition (OCR) tools. Issues include noisy backgrounds and faded text due to aging, border/marginal noise, bleed-through, skewing, warping, as well as irregular fonts and page layouts. As a result, OCR tools often produce a large number of spurious bounding boxes (BBs) in addition to those that correspond to words in the document. This paper presents an iterative classification algorithm to automatically label BBs (i.e., as text or noise) based on their spatial distribution and geometry. The approach uses a rule-base classifier to generate initial text/noise labels for each BB, followed by an iterative classifier that refines the initial labels by incorporating local information to each BB, its spatial location, shape and size. When evaluated on a dataset containing over 72,000 manually-labeled BBs from 159 historical documents, the algorithm can classify BBs with 0.95 precision and 0.96 recall. Further evaluation on a collection of 6,775 documents with ground-truth transcriptions shows that the algorithm can also be used to predict document quality (0.7 correlation) and improve OCR transcriptions in 85% of the cases.

Year	Venue	Field
2015	PROCEEDINGS OF THE TWENTY-NINTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE	Transcription (linguistics),Digitization,Image warping,Computer science,Optical character recognition,Artificial intelligence,Document quality,Classifier (linguistics),Spurious relationship,Machine learning,Bounding overwatch
DocType	Citations	PageRank
Conference	4	0.47
References	Authors
4	8

Authors (8 rows)

Cited by (4 rows)

References (4 rows)

Name	Order	Citations	PageRank
Anshul Gupta	1	5	0.81
Ricardo Gutierrez-Osuna	2	365	44.59
Matthew Christy	3	5	1.15
Boris Capitanu	4	48	6.49
Loretta Auvil	5	147	13.64
Liz Grumbach	6	4	0.47
Richard Furuta	7	1017	171.79
Laura Mandell	8	4	0.47

1