Abstract | ||
---|---|---|
This paper is a comparative study of feature selection
methods in statistical learning of text categorization.
The focus is on aggressive dimensionality reduction. Five
methods were evaluated, including term selection based on
document frequency (DF), information gain (IG), mutual
information (MI), a 2 -test (CHI), and term strength (TS).
We found IG and CHI most effective in our experiments.
Using IG thresholding with a k-nearest neighbor classifier
on the Reuters corpus, removal of up to 98\% removal of
unique terms actually yielded an improved classification
accuracy (measured by average precision). DF thresholding
performed similarly. Indeed we found strong correlations
between the DF, IG and CHI values of a term. This suggests
that DF thresholding, the simplest method with the lowest
cost in computation, can be reliably used instead of IG or
CHI when the computation of these measures are too
expensive. TS compares favorably with the other methods
with up to 50\% vocabulary reduction but is not
competitive at higher vocabulary reduction levels. In
contrast, MI had relatively poor performance due to its
bias towards favoring rare terms, and its sensitivity to
probability estimation errors. |
Year | Venue | Keywords |
---|---|---|
1997 | ICML | feature selection,comparative study,text categorization,information gain,mutual information,k nearest neighbor |
DocType | ISBN | Citations |
Conference | 1-55860-486-3 | 2364 |
PageRank | References | Authors |
233.86 | 4 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Yiming Yang | 1 | 3299 | 344.91 |
Jan O. Pedersen | 2 | 6301 | 1177.07 |