Title
PCA document reconstruction for email classification
Abstract
This paper presents a document classifier based on text content features and its application to email classification. We test the validity of a classifier which uses Principal Component Analysis Document Reconstruction (PCADR), where the idea is that principal component analysis (PCA) can compress optimally only the kind of documents-in our experiments email classes-that are used to compute the principal components (PCs), and that for other kinds of documents the compression will not perform well using only a few components. Thus, the classifier computes separately the PCA for each document class, and when a new instance arrives to be classified, this new example is projected in each set of computed PCs corresponding to each class, and then is reconstructed using the same PCs. The reconstruction error is computed and the classifier assigns the instance to the class with the smallest error or divergence from the class representation. We test this approach in email filtering by distinguishing between two message classes (e.g. spam from ham, or phishing from ham). The experiments show that PCADR is able to obtain very good results with the different validation datasets employed, reaching a better performance than the popular Support Vector Machine classifier.
Year
DOI
Venue
2012
10.1016/j.csda.2011.09.023
Computational Statistics & Data Analysis
Keywords
Field
DocType
document class,message class,classifier compute,class representation,email classification,new instance,computed pcs,document classifier,vector machine classifier,principal component,pca document reconstruction,new example,pca,principal component analysis,support vector machine,feature extraction
Data mining,Email filtering,Document reconstruction,Pattern recognition,Phishing,Computer science,Email classification,Feature extraction,Artificial intelligence,Margin classifier,Classifier (linguistics),Principal component analysis
Journal
Volume
Issue
ISSN
56
3
0167-9473
Citations 
PageRank 
References 
25
0.79
32
Authors
4
Name
Order
Citations
PageRank
Juan Carlos Gomez18412.89
Marie-Francine Moens21750139.27
GomezJuan Carlos3250.79
MoensMarie-Francine4261.17