Title
High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Itemsets
Abstract
High dimensionality remains a significant challenge for document clustering. Recent approaches used frequent itemsets and closed frequent itemsets to reduce dimensionality, and to improve the efficiency of hierarchical document clustering. In this paper, we introduce the notion of "closed interesting" itemsets (i.e. closed itemsets with high interestingness). We provide heuristics such as "super item" to efficiently mine these itemsets and show that they provide significant dimensionality reduction over closed frequent itemsets. Using "closed interesting" itemsets, we propose a new, sub-linearly scalable, hierarchical document clustering method that outperforms state of the art agglomerative, partitioning and frequent-itemset based methods both in terms of clustering quality and runtime performance, without requiring dataset specific parameter tuning. We evaluate twenty interestingness measures and show that when used to generate "closed interesting" itemsets, and to select parent nodes, Mutual Information, Added Value, Yule's Q and Chi- Square offer best clustering performance.
Year
DOI
Venue
2006
10.1109/ICDM.2006.81
ICDM
Keywords
Field
DocType
closed interesting itemsets,document clustering,frequent itemsets,efficient hierarchical document clustering,hierarchical document,hierarchical document clustering,high quality,clustering performance,closed frequent itemsets,high interestingness,significant dimensionality reduction,high dimensionality,clustering quality,added value,computer science,data mining,dimensionality reduction,mutual information
Hierarchical clustering,Data mining,Dimensionality reduction,Computer science,Document clustering,Curse of dimensionality,Heuristics,Mutual information,Cluster analysis,Scalability
Conference
ISSN
ISBN
Citations 
1550-4786
0-7695-2701-9
14
PageRank 
References 
Authors
0.77
10
2
Name
Order
Citations
PageRank
Hassan H. Malik1775.10
John R. Kender2627138.04