Feature selection based on a normalized difference measure for text classification. - Citegraph

Paper Info

Title
Feature selection based on a normalized difference measure for text classification.

Abstract
We analyzed Balanced Accuracy (ACC2)feature ranking metrics and identified its draw backs.We proposed to normalize Balanced Accuracy by minimum of tpr and fpr values.We compared results of proposed feature ranking metric with seven well known feature ranking metrics on seven datasets.Newly proposed metric outperforms in more than 60% cases of our experimental trials. The goal of feature selection in text classification is to choose highly distinguishing features for improving the performance of a classifier. The well-known text classification feature selection metric named balanced accuracy measure (ACC2) (Forman, 2003) evaluates a term by taking the difference of its document frequency in the positive class (also known as true positives) and its document frequency in the negative class (also known as false positives). This however results in assigning equal ranks to terms having equal difference, ignoring their relative document frequencies in the classes. In this paper we propose a new feature ranking (FR) metric, called normalized difference measure (NDM), which takes into account the relative document frequencies. The performance of NDM is investigated against seven well known feature ranking metrics including odds ratio (OR), chi squared (CHI), information gain (IG), distinguishing feature selector (DFS), gini index (GINI) ,balanced accuracy measure (ACC2) and Poisson ratio (POIS) on seven datasets namely WebACE(WAP,K1a,K1b), Reuters (RE0, RE1),spam email dataset and 20 newsgroups using the multinomial naive Bayes (MNB) and supports vector machines (SVM) classifiers. Our results show that the NDM metric outperforms the seven metrics in 66% cases in terms of macro-F1 measure and in 51% cases in terms of micro F1 measure in our experimental trials on these datasets.

Year	DOI	Venue
2017	10.1016/j.ipm.2016.12.004	Inf. Process. Manage.
Keywords	Field	DocType
Text classification,Feature selection,Accuracy measure,Document frequency	Chi-square test,Data mining,Normalization (statistics),Information retrieval,Feature selection,Computer science,Feature ranking,Support vector machine,Statistics,Classifier (linguistics),True positive rate,False positive paradox	Journal
Volume	Issue	ISSN
53	2	0306-4573
Citations	PageRank	References
9	0.46	23
Authors
3

Authors (3 rows)

Cited by (9 rows)

References (23 rows)

Name	Order	Citations	PageRank
Rehman, A.	1	11	1.19
Kashif Javed	2	110	8.87
Haroon Atique Babri	3	226	6.97

1