Title
Unsupervised segmentation of Chinese text by use of branching entropy
Abstract
We propose an unsupervised segmentation method based on an assumption about language data: that the increasing point of entropy of successive characters is the location of a word boundary. A large-scale experiment was conducted by using 200 MB of unsegmented training data and 1 MB of test data, and precision of 90% was attained with recall being around 80%. Moreover, we found that the precision was stable at around 90% independently of the learning data size.
Year
Venue
Keywords
2006
ACL
large-scale experiment,test data,chinese text,unsupervised segmentation method,unsegmented training data,language data,successive character,increasing point,data size,unsupervised segmentation,word boundary
DocType
Volume
Citations 
Conference
P06-2
35
PageRank 
References 
Authors
1.60
8
2
Name
Order
Citations
PageRank
Zhihui Jin1543.24
Kumiko Tanaka-Ishii226136.69