Title
Information gain and divergence-based feature selection for machine learning-based text categorization
Abstract
Most previous works of feature selection emphasized only the reduction of high dimensionality of the feature space. But in cases where many features are highly redundant with each other, we must utilize other means, for example, more complex dependence models such as Bayesian network classifiers. In this paper, we introduce a new information gain and divergence-based feature selection method for statistical machine learning-based text categorization without relying on more complex dependence models. Our feature selection method strives to reduce redundancy between features while maintaining information gain in selecting appropriate features for text categorization. Empirical results are given on a number of dataset, showing that our feature selection method is more effective than Koller and Sahami's method [Koller, D., & Sahami, M. (1996). Toward optimal feature selection. In Proceedings of ICML-96, 13th international conference on machine learning], which is one of greedy feature selection methods, and conventional information gain which is commonly used in feature selection for text categorization. Moreover, our feature selection method sometimes produces more improvements of conventional machine learning algorithms over support vector machines which are known to give the best classification accuracy.
Year
DOI
Venue
2006
10.1016/j.ipm.2004.08.006
Inf. Process. Manage.
Keywords
Field
DocType
divergence-based feature selection method,feature selection,information gain and divergence-based feature selection,optimal feature selection,complex dependence model,feature selection method,greedy feature selection method,machine learning-based text categorization,text categorization,conventional information gain,feature space,appropriate feature,information gain,support vector machine,machine learning
Data mining,Dimensionality reduction,Feature selection,Computer science,Feature (machine learning),Artificial intelligence,Feature vector,Pattern recognition,Feature (computer vision),Feature extraction,Minimum redundancy feature selection,Feature learning,Machine learning
Journal
Volume
Issue
ISSN
42
1
Information Processing and Management
Citations 
PageRank 
References 
115
3.28
8
Authors
2
Search Limit
100115
Name
Order
Citations
PageRank
Changki Lee127926.18
Gary Geunbae Lee293293.23