Title
Japanese language model based on bigrams and its application to on-line character recognition
Abstract
This paper deals with a postprocessing method based on the n-gram approach for Japanese character recognition. In Japanese a small number of phonetic characters (Kana) and thousands of Kanji characters, which are ideographs, are used for describing ordinary sentences. In other words, Japanese sentences not only have a large character set, but also include characters with different entropies. It is therefore difficult to apply conventional methodologies based on n-grams to postprocessing in Japanese character recognition. In order to resolve the above two difficulties, we propose a method that uses parts of speech in the following ways. One is to reduce the number of Kanji characters by clustering them according to the parts of speech that each Kanji character is used in. Another is to increase the entropy of a Kana character by classifying it into more detailed subcategories with part-of-speech attributes. We applied a bigram approach based on these two techniques to a Japanese language model. Experiments yielded the following two results: (1) our language model resolved the imbalance between Kana and Kanji characters, and reduced the perplexity of Japanese to less than 100, when Japanese newspaper texts (containing a total of approximately three million characters) were used for the learning of our model, and (2) the postprocessing using the model for on-line character recognition rectified about half of all substitution errors when the correct characters were among the candidates.
Year
DOI
Venue
1995
10.1016/0031-3203(94)E0053-N
Pattern Recognition
Keywords
Field
DocType
n-gram,Postprocessing,On-line character recognition,Language model,Japanese Morphological analysis,Part-of-speech
Perplexity,Part of speech,Speech recognition,Artificial intelligence,Natural language processing,Bigram,Cluster analysis,Character encoding,Mathematics,Language model,Kana,Kanji
Journal
Volume
Issue
ISSN
28
2
0031-3203
Citations 
PageRank 
References 
2
0.40
5
Authors
1
Name
Order
Citations
PageRank
Nobuyasu Itoh16513.19