Title
Term frequency combined hybrid feature selection method for spam filtering.
Abstract
Feature selection is an important technology on improving the efficiency and accuracy of spam filtering. Among the numerous methods, document frequency-based feature selections ignore the effect of term frequency information, thus always deduce unsatisfactory results. In this paper, a hybrid method (called HBM), which combines the document frequency information and term frequency information is proposed. To maintain the category distinguishing ability of the selected features, an optimal document frequency-based feature selection (called ODFFS) is chosen; terms which are indeed discriminative will be selected by ODFFS. For the remaining features, term frequency information is considered and the terms with the highest HBM values are selected. Further, a novel method called feature subset evaluating parameter optimization (FSEPO) is proposed for parameter optimization. Experiments with support vector machine (SVM) and Naïve Bayesian (NB) classifiers are applied on four corpora: PU1, LingSpam, SpamAssian and Trec2007. Six feature selections: information gain, Chi square, improved Gini-index, multi-class odds ratio, normalized term frequency-based discriminative power measure and comprehensively measure feature selection are compared with HBM. Experimental results show that, HBM is significantly superior to other feature selection methods on four corpora when SVM and NB are applied, respectively.
Year
DOI
Venue
2016
10.1007/s10044-014-0408-4
Pattern Analysis and Applications
Keywords
Field
DocType
Feature selection, Spam filtering, Document frequency, Term frequency, Parameter optimization
Data mining,Normalization (statistics),Feature selection,Artificial intelligence,Discriminative model,Chi-square test,Naive Bayes classifier,Pattern recognition,Information gain,Support vector machine,Filter (signal processing),Mathematics,Machine learning
Journal
Volume
Issue
ISSN
19
2
1433-755X
Citations 
PageRank 
References 
7
0.40
32
Authors
4
Name
Order
Citations
PageRank
Yuan-Ning Liu116022.94
Youwei Wang212516.91
Lizhou Feng3524.81
Xiaodong Zhu47310.24