An efficient clustering approach for large document collections - Citegraph

Paper Info

Title
An efficient clustering approach for large document collections

Abstract
A vast amount of unstructured text data, such as scientific publications, commercial reports and webpages are required to be quickly categorized into different semantic groups for facilitating online information query. However, the state-of-the art clustering methods are suffered from the huge size of documents with high-dimensional text features. In this paper, we propose an efficient clustering algorithm for large document collections, which performs clustering in three stages: 1) by using permutation test, the informative topic words are identified so as to reduce feature dimension; 2) selecting a small number of most typical documents to perform initial clustering 3) refining clustering on all documents. The algorithm was tested by the 20 newsgroup data and experimental results showed that, comparing with the methods which cluster corpus based on all document samples and full features directly, this approach significantly reduced the time cost in an order while slightly improving the clustering quality.

Year	DOI	Venue
2005	10.1007/11527503_29	ADMA
Keywords	Field	DocType
typical document,large document collection,efficient clustering approach,unstructured text data,newsgroup data,high-dimensional text feature,cluster corpus,efficient clustering algorithm,document sample,initial clustering,clustering quality,permutation test	Fuzzy clustering,Data mining,CURE data clustering algorithm,Document clustering,Computer science,Artificial intelligence,Cluster analysis,Canopy clustering algorithm,Clustering high-dimensional data,Data stream clustering,Information retrieval,Brown clustering,Machine learning	Conference
Volume	ISSN	ISBN
3584	0302-9743	3-540-27894-X
Citations	PageRank	References
0	0.34	8
Authors
3

Authors (3 rows)

Cited by (0 rows)

References (8 rows)

Name	Order	Citations	PageRank
Bo Han	1	6	1.53
Lishan Kang	2	775	91.11
Huazhu Song	3	17	6.88

1