Abstract | ||
---|---|---|
We propose an unsupervised segmentation method based on an assumption about language data: that the increasing point of entropy of successive characters is the location of a word boundary. A large-scale experiment was conducted by using 200 MB of unsegmented training data and 1 MB of test data, and precision of 90% was attained with recall being around 80%. Moreover, we found that the precision was stable at around 90% independently of the learning data size. |
Year | Venue | Keywords |
---|---|---|
2006 | ACL | large-scale experiment,test data,chinese text,unsupervised segmentation method,unsegmented training data,language data,successive character,increasing point,data size,unsupervised segmentation,word boundary |
DocType | Volume | Citations |
Conference | P06-2 | 35 |
PageRank | References | Authors |
1.60 | 8 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Zhihui Jin | 1 | 54 | 3.24 |
Kumiko Tanaka-Ishii | 2 | 261 | 36.69 |