Highly discriminative statistical features for email classification - Citegraph

Paper Info

Title
Highly discriminative statistical features for email classification

Abstract
This paper reports on email classification and filtering, more specifically on spam versus ham and phishing versus spam classification, based on content features. We test the validity of several novel statistical feature extraction methods. The methods rely on dimensionality reduction in order to retain the most informative and discriminative features. We successfully test our methods under two schemas. The first one is a classic classification scenario using a 10-fold cross-validation technique for several corpora, including four ground truth standard corpora: Ling-Spam, SpamAssassin, PU1, and a subset of the TREC 2007 spam corpus, and one proprietary corpus. In the second schema, we test the anticipatory properties of our extracted features and classification models with two proprietary datasets, formed by phishing and spam emails sorted by date, and with the public TREC 2007 spam corpus. The contributions of our work are an exhaustive comparison of several feature selection and extraction methods in the frame of email classification on different benchmarking corpora, and the evidence that especially the technique of biased discriminant analysis offers better discriminative features for the classification, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time and data setups. These findings are especially useful in a commercial setting, where short profile rules are built based on a limited number of features for filtering emails.

Year	DOI	Venue
2012	10.1007/s10115-011-0403-7	Knowl. Inf. Syst.
Keywords	Field	DocType
discriminative statistical feature,discriminative feature,different benchmarking corpus,classic classification scenario,spam emails,classification model,spam corpus,data mining · dimensionality reduction · email classification · feature extraction · feature selection,email classification,stable classification result,discriminative value,spam classification	Data mining,Dimensionality reduction,Feature selection,Phishing,Computer science,Filter (signal processing),Feature extraction,Ground truth,Artificial intelligence,Discriminative model,Benchmarking,Machine learning	Journal
Volume	Issue	ISSN
31	1	0219-3116
Citations	PageRank	References
15	0.63	45
Authors
3

Authors (3 rows)

Cited by (15 rows)

References (45 rows)

Name	Order	Citations	PageRank
Juan Carlos Gomez	1	84	12.89
Erik Boiy	2	257	9.55
Marie-Francine Moens	3	1750	139.27

1