Title
Fast Optical Character Recognition through Glyph Hashing for Document Conversion
Abstract
This paper proposes a glyph hashing approach to optical character recognition with applications in document conversion. The viability and efficiency of the approach is tested through its implementation in a print driver on 68,987 PDF documents containing 1.15 billion characters. Results indicate that a hash table with (a) 3.2 million hashes is sufficient to represent all characters from these documents, and (b) 480 fonts are sufficient to cover over 90% of these documents. Glyph recognizing experiments indicate that 80% of unique character glyphs and over 96% of all characters from unseen documents can be found in a hash table built using all 68,987 documents. The hashing approach is used to not only recognize the character codes but also, size, style (bold, italic, etc), and font name. We found that the hashing approach can scale to hundreds of fonts and thousands of characters per font. Further, it is extremely fast and can recognize over 100,000 characters per second. Owing to its speed, such a hashing approach can complement any existing OCR system by acting as a pre-filter to produce a 4-5 times speedup during document conversion.
Year
DOI
Venue
2005
10.1109/ICDAR.2005.110
ICDAR-1
Keywords
Field
DocType
pdf document,fast optical character recognition,hash table,document conversion,unique character glyphs,million hash,font name,billion character,optical character recognition,glyph hashing,character code,unseen document,character sets
Glyph,Pattern recognition,Computer science,Font,Document processing,Optical character recognition,Hash function,Artificial intelligence,Character encoding,Character large object,Hash table
Conference
ISSN
ISBN
Citations 
1520-5363
0-7695-2420-6
2
PageRank 
References 
Authors
0.39
0
3
Name
Order
Citations
PageRank
Kumar Chellapilla195162.13
Patrice Simard21268621.43
Radoslav Nickolov320.39