Title
A study on text clustering algorithms based on frequent term sets
Abstract
In this paper, a new text-clustering algorithm named Frequent Term Set-based Clustering (FTSC) is introduced. It uses frequent term sets to cluster texts. First, it extracts useful information from documents and inserts into databases. Then, it uses the Apriori algorithm based on association rules mining efficiently to discover the frequent items sets. Finally, it clusters the documents according to the frequent words in subsets of the frequent term sets. This algorithm can reduce the dimension of the text data efficiently for very large databases, thus it can improve the accuracy and speed of the clustering algorithm. The results of clustering texts by the FTSC algorithm cannot reflect the overlap of texts' classes. Based on the FTSC algorithm, an improved algorithm—Frequent Term Set-based Hierarchical Clustering algorithm (FTSHC) is given. This algorithm can determine the overlap of texts' classes by the overlap of the frequent words sets, and provide an understandable description of the discovered clusters by the frequent terms sets. The FTSC, FTSHC and K-Means algorithms are evaluated quantitatively by experiments. The results of the experiments prove that FTSC and FTSHC algorithms are more efficient than K-Means algorithm in the performance of clustering.
Year
DOI
Venue
2005
10.1007/11527503_42
ADMA
Keywords
Field
DocType
frequent items set,improved algorithm,ftshc algorithm,frequent term set,hierarchical clustering algorithm,clustering algorithm,k-means algorithm,new text-clustering algorithm,frequent term,apriori algorithm,ftsc algorithm,text clustering,k means algorithm,association rule mining,very large database,hierarchical clustering
Fuzzy clustering,Data mining,CURE data clustering algorithm,Computer science,Document clustering,Apriori algorithm,Artificial intelligence,Cluster analysis,Hierarchical clustering,k-means clustering,Canopy clustering algorithm,Pattern recognition,Algorithm,Machine learning
Conference
Volume
ISSN
ISBN
3584
0302-9743
3-540-27894-X
Citations 
PageRank 
References 
7
0.57
6
Authors
2
Name
Order
Citations
PageRank
Xiangwei Liu1186.26
Pilian He2297.46