Title
Adding Semantics to Email Clustering
Abstract
This paper presents a novel algorithm to cluster emails according to their contents and the sentence styles of their subject lines. In our algorithm, natural language processing techniques and frequent itemset mining techniques are utilized to automatically generate meaningful generalized sentence patterns (GSPs) from subjects of emails. Then we put forward a novel unsupervised approach which treats GSPs as pseudo class labels and conduct email clustering in a supervised manner, although no human labeling is involved. Our proposed algorithm is not only expected to improve the clustering performance, it can also provide meaningful descriptions of the resulted clusters by the GSPs. Experimental results on open dataset (Enron email dataset) and a personal email dataset collected by ourselves demonstrate that the proposed algorithm outperforms the K-means algorithm in terms of the popular measurement F1. Furthermore, the cluster naming readability is improved by 68.5% on the personal email dataset.
Year
DOI
Venue
2006
10.1109/ICDM.2006.16
ICDM
Keywords
Field
DocType
email clustering,personal email dataset,open dataset,enron email dataset,clustering performance,meaningful generalized sentence pattern,k-means algorithm,meaningful description,novel algorithm,proposed algorithm,natural language processing,k means algorithm,learning artificial intelligence
Data mining,Computer science,Pattern clustering,Readability,Artificial intelligence,Cluster analysis,Sentence,Machine learning,Semantics
Conference
ISSN
ISBN
Citations 
1550-4786
0-7695-2701-9
8
PageRank 
References 
Authors
0.67
3
5
Name
Order
Citations
PageRank
Hua Li157925.22
Dou Shen2122459.46
Benyu Zhang3213590.41
Zheng Chen45019256.89
Qiang Yang517039875.69