Chinese text categorization using the character N-gram - Citegraph

Paper Info

Title
Chinese text categorization using the character N-gram

Abstract
We previously proposed the accumulation method, which is a language-independent text classification method that is based on the character N-gram, and classified English, Japanese, and Korean text documents. The accumulation method does not depend on the language structure, because this method uses the character N-gram to form index terms. If text documents are expressed in Unicode, then the accumulation method can classify documents using the same algorithm. In the present paper, we classify Chinese text documents, which are newspaper articles from the People's Daily 2009-2010 data set. The highest macro-averaged F-measure of the proposed method was 92.6% for the People's Daily 2009-2010 data set. Thus, we obtain good results for the Chinese language. Moreover, we can construct a framework whereby the computer can automatically distinguish the difficulty of each document classification.

Year	Venue	Keywords
2012	Information Theory and its Applications	classification,indexing,natural language processing,text analysis,Chinese language,Chinese text categorization,Chinese text document classification,English text document,Japanese text document,Korean text document,Unicode,accumulation method,character N-gram,index term,language structure,language-independent text classification method,macroaveraged F-measure,newspaper article
Field	DocType	ISBN
Document classification,Text graph,Noisy text analytics,Computer science,Full text search,Text segmentation,Language identification,Natural language processing,Artificial intelligence,n-gram,Unicode	Conference	978-1-4673-2521-9
Citations	PageRank	References
0	0.34	4
Authors
3

Authors (3 rows)

Cited by (0 rows)

References (4 rows)

Name	Order	Citations	PageRank
Makoto Suzuki	1	4	1.27
Naohide Yamagishi	2	0	0.34
Yi-Ching Tsai	3	0	0.34

1