A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering - Citegraph

Paper Info

Title
A Sequential Sampling Framework for Spectral k-Means Based on Efficient Bootstrap Accuracy Estimations: Application to Distributed Clustering

Abstract
The scalability of learning algorithms has always been a central concern for data mining researchers, and nowadays, with the rapid increase in data storage capacities and availability, its importance has increased. To this end, sampling has been studied by several researchers in an effort to derive sufficiently accurate models using only small data fractions. In this article we focus on spectral k-means, that is, the k-means approximation as derived by the spectral relaxation, and propose a sequential sampling framework that iteratively enlarges the sample size until the k-means results (objective function and cluster structure) become indistinguishable from the asymptotic (infinite-data) output. In the proposed framework we adopt a commonly applied principle in data mining research that considers the use of minimal assumptions concerning the data generating distribution. This restriction imposes several challenges, mainly related to the efficiency of the sequential sampling procedure. These challenges are addressed using elements of matrix perturbation theory and statistics. Moreover, although the main focus is on spectral k-means, we also demonstrate that the proposed framework can be generalized to handle spectral clustering. The proposed sequential sampling framework is consecutively employed for addressing the distributed clustering problem, where the task is to construct a global model for data that resides in distributed network nodes. The main challenge in this context is related to the bandwidth constraints that are commonly imposed, thus requiring that the distributed clustering algorithm consumes a minimal amount of network load. This illustrates the applicability of the proposed approach, as it enables the determination of a minimal sample size that can be used for constructing an accurate clustering model that entails the distributional characteristics of the data. As opposed to the relevant distributed k-means approaches, our framework takes into account the fact that the choice of the number of clusters has a crucial effect on the required amount of communication. More precisely, the proposed algorithm is able to derive a statistical estimation of the required relative sizes for all possible values of k. This unique feature of our distributed clustering framework enables a network administrator to choose an economic solution that identifies the crude cluster structure of a dataset and not devote excessive network resources for identifying all the “correct” detailed clusters.

Year	DOI	Venue
2012	10.1145/2297456.2297457	TKDD
Keywords	Field	DocType
accurate clustering model,sequential sampling framework,data storage capacity,data mining research,efficient bootstrap accuracy estimations,proposed framework,small data fraction,data mining researcher,spectral k-means,clustering algorithm,clustering framework,bootstrapping,sampling,clustering	k-medians clustering,Data mining,Fuzzy clustering,Clustering high-dimensional data,CURE data clustering algorithm,Data stream clustering,Correlation clustering,Computer science,Artificial intelligence,Constrained clustering,Cluster analysis,Machine learning	Journal
Volume	Issue	ISSN
6	2	1556-4681
Citations	PageRank	References
2	0.37	36
Authors
2

Authors (2 rows)

Cited by (2 rows)

References (36 rows)

Name	Order	Citations	PageRank
Dimitrios Mavroeidis	1	130	9.50
Panagis Magdalinos	2	35	5.55

1