Title
A logistic regression-based smoothing method for Chinese text categorization
Abstract
Automatic Chinese text classification is an important and a well-known technology in the field of machine learning. The first step for solving Chinese text categorization problems is to tokenize the Chinese words from a sequence of non-segmented sentences. However, previous literatures often employ a Chinese word tokenizer that was trained with different sources and then perform the conventional text classification approaches. However, these taggers are not perfect and often provide incorrect word boundary information. In this paper, we propose an N-gram-based language model which takes word relations into account for Chinese text categorization without Chinese word tokenizer. To prevent from out-of-vocabulary, we also propose a novel smoothing approach based on logistic regression to improve accuracy. The experimental result shows that our approach outperforms traditional methods at least 11% on micro-average F-measure.
Year
DOI
Venue
2011
10.1016/j.eswa.2011.03.036
Expert Syst. Appl.
Keywords
Field
DocType
text classification,chinese text categorization,logistic regression,different source,n -gram-based classification,chinese word tokenizer,incorrect word boundary information,automatic chinese text classification,word segmentation,conventional text classification approach,feature selection,chinese word,chinese text categorization problem,word relation,n-gram-based language model,smoothing method,language model,machine learning
Feature selection,Computer science,Text segmentation,Smoothing,Natural language processing,Artificial intelligence,Lexical analysis,Text categorization,Logistic regression,Machine learning,Language model
Journal
Volume
Issue
ISSN
38
9
Expert Systems With Applications
Citations 
PageRank 
References 
9
0.48
15
Authors
4
Name
Order
Citations
PageRank
Show-Jane Yen1537130.05
Yue-Shi Lee254341.14
Jia-Ching Ying3343.18
Yu-Chieh Wu424723.16