Title
A large-scale empirical analysis of email spam detection through network characteristics in a stand-alone enterprise
Abstract
Spam is a never-ending issue that constantly consumes resources to no useful end. In this paper, we envision spam filtering as a pipeline consisting of DNS blacklists, filters based on SYN packet features, filters based on traffic characteristics and filters based on message content. Each stage of the pipeline examines more information in the message but is more computationally expensive. A message is rejected as spam once any layer is sufficiently confident. We analyze this pipeline, focusing on the first three layers, from a single-enterprise perspective. To do this we use a large email dataset collected over two years. We devise a novel ground truth determination system to allow us to label this large dataset accurately. Using two machine learning algorithms, we study (i) how the different pipeline layers interact with each other and the value added by each layer, (ii) the utility of individual features in each layer, (iii) stability of the layers across time and network events and (iv) an operational use case investigating whether this architecture can be practically useful. We find that (i) the pipeline architecture is generally useful in terms of accuracy as well as in an operational setting, (ii) it generally ages gracefully across long time periods and (iii) in some cases, later layers can compensate for poor performance in the earlier layers. Among the caveats we find are that (i) the utility of network features is not as high in the single enterprise viewpoint as reported in other prior work, (ii) major network events can sharply affect the detection rate, and (iii) the operational (computational) benefit of the pipeline may depend on the efficiency of the final content filter.
Year
DOI
Venue
2014
10.1016/j.comnet.2013.08.031
Computer Networks
Keywords
DocType
Volume
network feature,operational use case,earlier layer,operational setting,major network event,later layer,large-scale empirical analysis,email spam detection,network characteristic,different pipeline layers interact,message content,pipeline architecture,stand-alone enterprise,network event
Journal
59,
ISSN
Citations 
PageRank 
1389-1286
8
0.50
References 
Authors
18
4
Name
Order
Citations
PageRank
Tu Ouyang1151.35
Soumya Ray2948.89
Mark Allman33045278.07
Michael Rabinovich41212139.46