Title
Chinese text categorization using the character N-gram
Abstract
We previously proposed the accumulation method, which is a language-independent text classification method that is based on the character N-gram, and classified English, Japanese, and Korean text documents. The accumulation method does not depend on the language structure, because this method uses the character N-gram to form index terms. If text documents are expressed in Unicode, then the accumulation method can classify documents using the same algorithm. In the present paper, we classify Chinese text documents, which are newspaper articles from the People's Daily 2009-2010 data set. The highest macro-averaged F-measure of the proposed method was 92.6% for the People's Daily 2009-2010 data set. Thus, we obtain good results for the Chinese language. Moreover, we can construct a framework whereby the computer can automatically distinguish the difficulty of each document classification.
Year
Venue
Keywords
2012
Information Theory and its Applications
classification,indexing,natural language processing,text analysis,Chinese language,Chinese text categorization,Chinese text document classification,English text document,Japanese text document,Korean text document,Unicode,accumulation method,character N-gram,index term,language structure,language-independent text classification method,macroaveraged F-measure,newspaper article
Field
DocType
ISBN
Document classification,Text graph,Noisy text analytics,Computer science,Full text search,Text segmentation,Language identification,Natural language processing,Artificial intelligence,n-gram,Unicode
Conference
978-1-4673-2521-9
Citations 
PageRank 
References 
0
0.34
4
Authors
3
Name
Order
Citations
PageRank
Makoto Suzuki141.27
Naohide Yamagishi200.34
Yi-Ching Tsai300.34