Title
Multilingual machine printed OCR
Abstract
This paper presents a script-independent methodology for optical character recognition (OCR) based on the use of hidden Markov models (HMM). The feature extraction, training and recognition components of the system are all designed to be script independent. The training and recognition components were taken without modification from a continuous speech recognition system; the only component that is specific to OCR is the feature extraction component. To port the system to a new language, all that is needed is text image training data from the new language, along with ground truth which gives the identity of the sequences of characters along each line of each text image, without specifying the location of the characters on the image. The parameters of the character HMMs are estimated automatically from the training data, without the need for laborious handwritten rules. The system does not require presegmentation of the data, neither at the word level nor at the character level. Thus, the system is able to handle languages with connected characters in a straightforward manner. The script independence of the system is demonstrated in three languages with different types of script: Arabic, English, and Chinese. The robustness of the system is further demonstrated by testing the system on fax data. An unsupervised adaptation method is then described to improve performance under degraded conditions.
Year
DOI
Venue
2001
10.1142/S0218001401000745
IJPRAI
Keywords
Field
DocType
multilingual machine,pattern recognition,hidden markov models,markov model,markov process,optical character recognition,image segmentation,localization,speech recognition,adaptation
Pattern recognition,Markov model,Optical character recognition,Feature extraction,Robustness (computer science),Image segmentation,Speech recognition,Artificial intelligence,Constructed language,Hidden Markov model,Mathematics,Facsimile
Journal
Volume
Issue
ISSN
15
1
0218-0014
ISBN
Citations 
PageRank 
981-02-4564-5
19
1.37
References 
Authors
26
5
Name
Order
Citations
PageRank
Premkumar Natajan1191.37
Zhidong Lu2484.28
Richard Schwartz31232236.90
Issam Bazzi441238.82
John Makhoul516512.83