An effective web document clustering algorithm based on bisection and merge - Citegraph

Paper Info

Title
An effective web document clustering algorithm based on bisection and merge

Abstract
To cluster web documents, all of which have the same name entities, we attempted to use existing clustering algorithms such as K-means and spectral clustering. Unexpectedly, it turned out that these algorithms are not effective to cluster web documents. According to our intensive investigation, we found that clustering such web pages is more complicated because (1) the number of clusters (known as ground truth) is larger than two or three clusters as in general clustering problems and (2) clusters in the data set have extremely skewed distributions of cluster sizes. To overcome the aforementioned problem, in this paper, we propose an effective clustering algorithm to boost up the accuracy of K-means and spectral clustering algorithms. In particular, to deal with skewed distributions of cluster sizes, our algorithm performs both bisection and merge steps based on normalized cuts of the similarity graph G to correctly cluster web documents. Our experimental results show that our algorithm improves the performance by approximately 56% compared to spectral bisection and 36% compared to K-means.

Year	DOI	Venue
2011	10.1007/s10462-011-9203-4	Artif. Intell. Rev.
Keywords	Field	DocType
Clustering,Spectral bisection,Entity resolution,Data mining	Fuzzy clustering,Data mining,CURE data clustering algorithm,Computer science,Artificial intelligence,Cluster analysis,Single-linkage clustering,k-medians clustering,Canopy clustering algorithm,Complete-linkage clustering,Correlation clustering,Pattern recognition,Machine learning	Journal
Volume	Issue	ISSN
36	1	0269-2821
Citations	PageRank	References
6	0.48	12
Authors
2

Authors (2 rows)

Cited by (6 rows)

References (12 rows)

Name	Order	Citations	PageRank
Ingyu Lee	1	52	8.90
Byung-Won On	2	329	28.76

1