Title
Highly discriminative statistical features for email classification
Abstract
This paper reports on email classification and filtering, more specifically on spam versus ham and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features. We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross-validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema, we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different benchmarking corpora, and the evidence that especially the technique of biased discriminant analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.
Year
DOI
Venue
2012
10.1007/s10115-011-0403-7
Knowl. Inf. Syst.
Keywords
Field
DocType
discriminative statistical feature,discriminative feature,different benchmarking corpus,classic classification scenario,spam emails,classification model,spam corpus,data mining · dimensionality reduction · email classification · feature extraction · feature selection,email classification,stable classification result,discriminative value,spam classification
Data mining,Dimensionality reduction,Feature selection,Phishing,Computer science,Filter (signal processing),Feature extraction,Ground truth,Artificial intelligence,Discriminative model,Benchmarking,Machine learning
Journal
Volume
Issue
ISSN
31
1
0219-3116
Citations 
PageRank 
References 
15
0.63
45
Authors
3
Name
Order
Citations
PageRank
Juan Carlos Gomez18412.89
Erik Boiy22579.55
Marie-Francine Moens31750139.27