Title
On clustering massive text and categorical data streams
Abstract
In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present different challenges because of the large and un-ordered nature of the corresponding attributes. Therefore, we will propose algorithms for text and categorical data stream clustering. We will propose a condensation based approach for stream clustering which summarizes the stream into a number of fine grained cluster droplets. These summarized droplets can be used in conjunction with a variety of user queries to construct the clusters for different input parameters. Thus, this provides an online analytical processing approach to stream clustering. We also study the problem of detecting noisy and outlier records in real time. We will test the approach for a number of real and synthetic data sets, and show the effectiveness of the method over the baseline OSKM algorithm for stream clustering.
Year
DOI
Venue
2010
10.1007/s10115-009-0241-z
Knowl. Inf. Syst.
Keywords
Field
DocType
stream clustering · text clustering · text streams · text stream clustering · categorical data,massive text,numeric data stream,categorical data stream clustering,categorical data domain,categorical data,synthetic data set,online analytical processing approach,different input parameter,data stream,clustering problem,stream clustering,synthetic data,text clustering,real time
Fuzzy clustering,Data mining,Canopy clustering algorithm,Data stream mining,Clustering high-dimensional data,CURE data clustering algorithm,Data stream clustering,Correlation clustering,Computer science,Cluster analysis
Journal
Volume
Issue
ISSN
24
2
0219-3116
Citations 
PageRank 
References 
39
1.61
32
Authors
2
Name
Order
Citations
PageRank
Charu C. Aggarwal19081636.68
Philip S. Yu222612.27