Title
An Open Vocabulary OCR System with Hybrid Word-Subword Language Models
Abstract
The accuracy of a typical state-of-the-art optical character recognition (OCR) system benefits greatly from using a language model (LM). However, a conventional LM has a limited vocabulary, resulting in out-of-vocabulary (OOV) words that cannot be recognized by the OCR system. In this paper, we present an open vocabulary OCR system based on a hybrid LM. The vocabulary of the hybrid LM consists of both words and subwords. OOV words can be generated by combinations of subwords. A refined hybrid LM training scheme is applied by interpolating a standard hybrid LM, a word-based LM and a subword-based LM. An efficient word combination method is performed by modeling optional space symbols in a decoding network. The overall system deals with OOV words in a general, data-driven and language-independent way. We conduct experiments on an English handwriting OCR task. Evaluations on three testing sets demonstrate that the OCR system with the proposed method achieves a word error rate of 33.4% on an OOV-only testing set, yet without degrading the recognition accuracies on the other two testing sets mainly consisting of in-vocabulary words.
Year
DOI
Venue
2017
10.1109/ICDAR.2017.91
2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)
Keywords
Field
DocType
open vocabulary OCR system,hybrid word-subword language models,language model,out-of-vocabulary,OOV words,refined hybrid LM training scheme,standard hybrid LM,efficient word combination method,English handwriting OCR task,word error rate,in-vocabulary words,optical character recognition system
Hybrid word,Task analysis,Handwriting,Pattern recognition,Computer science,Word error rate,Optical character recognition,Speech recognition,Artificial intelligence,Decoding methods,Vocabulary,Language model
Conference
Volume
ISSN
ISBN
01
1520-5363
978-1-5386-3587-2
Citations 
PageRank 
References 
2
0.39
0
Authors
7
Name
Order
Citations
PageRank
Meng Cai1688.24
Wenping Hu2826.77
Kai Chen3715.38
Lei Sun4183.40
Sen Liang581.21
Xiongjian Mo630.77
Qiang Huo7109899.69