Title
A two-stage Markov blanket based feature selection algorithm for text classification.
Abstract
Designing a good feature selection (FS) algorithm is of utmost importance especially for text classification (TC), wherein a large number of features representing terms or words pose serious challenges to the effectiveness and efficiency of classifiers. FS algorithms are divided into two broad categories, namely, feature ranking (FR) and feature subset selection (FSS) algorithms. Unlike FSS, FR algorithms select those features that are individually highly relevant for the class or category without taking the feature interactions into account. This makes FR algorithms simple and computationally more efficient than FSS and thus, mostly a preferred choice for TC. Bi-normal separation (BNS) (Forman, 2003) and information gain (IG) (Yang and Pedersen, 1997) are well-known FR metrics. However, FR algorithms output a set of highly relevant features or terms which can possibly be redundant and can thus, deteriorate a classifier׳s performance. This paper suggests taking the interactions of words into account in order to eliminate redundant terms. Stand-alone FSS algorithms can be computationally expensive for the high-dimensional text data. We therefore suggest a two-stage FS algorithm, which employs an FR metric such as BNS or IG in the first stage and an FSS algorithm such as the Markov blanket filter (MBF) (Koller and Sahami, 1996) in the second stage. Most of the two-stage algorithms proposed in the literature for TC combine feature ranking and feature transformation such as principal component analysis (PCA) algorithms. To estimate the statistical significance of our two-stage algorithm, we carry out experiments on 10 different splits of training and test sets of each of the three (Reuters-21578, TREC, OHSUMED) data sets with naive Bayes׳ and support vector machines. Our results based on a paired two-sided t-test show that the macro F1 performance of BNS+MBF is statistically significant than that of stand-alone BNS in 69% of the total experimental trials. The macro F1 values of IG get enhanced in 72% of the trials when MBF is used in the second stage. We also compare our two-stage algorithm against two recently proposed FS algorithms, namely, distinguishing feature selector (DFS) (Uysal and Gunal, 2012) and a two stage algorithm consisting of IG and PCA algorithms (Uguz, 2011). BNS+MBF is found to be significantly better than DFS and IG+PCA in 74 and 78% of the trials respectively. IG+MBF outperforms DFS and IG+PCA in 93 and 80% of the experimental trials respectively. Similar results are observed for BNS+MBF and IG+MBF when the performances are evaluated in terms of balanced error rate.
Year
DOI
Venue
2015
10.1016/j.neucom.2015.01.031
Neurocomputing
Keywords
Field
DocType
Text classification,Two-stage feature selection,Markov blanket discovery
Data set,Feature selection,Computer science,Markov blanket,Artificial intelligence,Classifier (linguistics),Pattern recognition,Naive Bayes classifier,Support vector machine,Word error rate,Algorithm,Principal component analysis,Machine learning
Journal
Volume
ISSN
Citations 
157
0925-2312
18
PageRank 
References 
Authors
0.68
33
3
Name
Order
Citations
PageRank
Kashif Javed11108.87
Sameen Maruf2263.13
Haroon A. Babri3814.63