Title
A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering
Abstract
The scalability of learning algorithms has always been a central concern for data mining researchers, and nowadays, with the rapid increase in data storage capacities and availability, its importance has increased. To this end, sampling has been studied by several researchers in an effort to derive sufficiently accurate models using only small data fractions. In this article we focus on spectral k-means, that is, the k-means approximation as derived by the spectral relaxation, and propose a sequential sampling framework that iteratively enlarges the sample size until the k-means results (objective function and cluster structure) become indistinguishable from the asymptotic (infinite-data) output. In the proposed framework we adopt a commonly applied principle in data mining research that considers the use of minimal assumptions concerning the data generating distribution. This restriction imposes several challenges, mainly related to the efficiency of the sequential sampling procedure. These challenges are addressed using elements of matrix perturbation theory and statistics. Moreover, although the main focus is on spectral k-means, we also demonstrate that the proposed framework can be generalized to handle spectral clustering. The proposed sequential sampling framework is consecutively employed for addressing the distributed clustering problem, where the task is to construct a global model for data that resides in distributed network nodes. The main challenge in this context is related to the bandwidth constraints that are commonly imposed, thus requiring that the distributed clustering algorithm consumes a minimal amount of network load. This illustrates the applicability of the proposed approach, as it enables the determination of a minimal sample size that can be used for constructing an accurate clustering model that entails the distributional characteristics of the data. As opposed to the relevant distributed k-means approaches, our framework takes into account the fact that the choice of the number of clusters has a crucial effect on the required amount of communication. More precisely, the proposed algorithm is able to derive a statistical estimation of the required relative sizes for all possible values of k. This unique feature of our distributed clustering framework enables a network administrator to choose an economic solution that identifies the crude cluster structure of a dataset and not devote excessive network resources for identifying all the “correct” detailed clusters.
Year
DOI
Venue
2012
10.1145/2297456.2297457
TKDD
Keywords
Field
DocType
accurate clustering model,sequential sampling framework,data storage capacity,data mining research,efficient bootstrap accuracy estimations,proposed framework,small data fraction,data mining researcher,spectral k-means,clustering algorithm,clustering framework,bootstrapping,sampling,clustering
k-medians clustering,Data mining,Fuzzy clustering,Clustering high-dimensional data,CURE data clustering algorithm,Data stream clustering,Correlation clustering,Computer science,Artificial intelligence,Constrained clustering,Cluster analysis,Machine learning
Journal
Volume
Issue
ISSN
6
2
1556-4681
Citations 
PageRank 
References 
2
0.37
36
Authors
2
Name
Order
Citations
PageRank
Dimitrios Mavroeidis11309.50
Panagis Magdalinos2355.55