Title | ||
---|---|---|
A global evaluation criterion for feature selection in text categorization using Kullback-Leibler divergence. |
Abstract | ||
---|---|---|
A major difficulty of text categorization is extremely high dimensionality of text feature space. The use of feature selection techniques for large-scale text categorization task is desired for improving the accuracy and efficiency. x 2 statistic and simplified x 2 are two effective feature selection methods in text categorization. Using these two feature selection criteria, for a term, one needs to compute the local scores of the term over each category and usually takes the maximum or average value of these scores as the global term-goodness criterion. But there is no explicit explanation on how to choose maximum or average; moreover, these two operations can not reflect the degree of scatter of a term over all categories. In this paper, we propose a new global feature evaluation criterion based on Kullback-Leibler (KL) divergence for choosing informative terms since KL divergence is a widely used method to measure the differences of distributions between two categories. We conduct experiments on Reuters-21578 corpus with k-NN classifier to test the performance of the proposed method. The experimental results show that this method enhances the performance of text categorization. The novel method is similar or better than previous maximum and average on either Macro-F1 or Micro-F1. © 2011 IEEE. |
Year | DOI | Venue |
---|---|---|
2011 | 10.1109/SoCPaR.2011.6089284 | SoCPaR |
Keywords | Field | DocType |
chi-square statistic,feature selection,global evaluation criterion,kullback-leibler divergence,text categorization,text analysis,feature extraction,machine learning,support vector machine,kullback leibler divergence,kullback leibler,feature space | Feature vector,Text mining,Feature selection,Pattern recognition,Statistic,Computer science,Curse of dimensionality,Feature extraction,Artificial intelligence,Classifier (linguistics),Kullback–Leibler divergence,Machine learning | Conference |
Volume | Issue | Citations |
null | null | 6 |
PageRank | References | Authors |
0.48 | 22 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Zhilong Zhen | 1 | 6 | 1.15 |
Xiaoqin Zeng | 2 | 407 | 32.97 |
Haijuan Wang | 3 | 6 | 0.81 |
Lixin Han | 4 | 135 | 14.47 |