Abstract | ||
---|---|---|
Virtually every sector of business and industry that use computing, including financial analysis, search engines, and electronic commerce, incorporate Big Data analysis into their business model. Sophisticated clustering algorithms are highly desired to deduce the nature of data by assigning labels to unlabeled data. We address two main challenges in Big Data. First, by definition, the volume of Big Data is too large to be loaded into a computer's memory (this volume changes based on the computer used or available). Second, in real-time applications, the velocity of new incoming data prevents historical data from being stored and future data from being accessed. Therefore, we propose our Streaming Kernel Fuzzy c-Means (stKFCM) algorithm, which reduces both computational complexity and space complexity significantly. The proposed stKFCM only requires O(n2) memory where n is the (predetermined) size of a data subset (or data chunk) at each time step, which makes this algorithm truly scalable (as n can be chosen based on the available memory). Furthermore, only 2n2 elements of the full N × N (where N >> n) kernel matrix need to be calculated at each time-step, thus reducing both the computation time in producing the kernel elements and the complexity of the FCM algorithm. Empirical results show that stKFCM, even with very small n, can provide clustering performance as accurately as kernel fuzzy c-means run on the entire data set while achieving a significant speedup. |
Year | DOI | Venue |
---|---|---|
2013 | 10.1109/BigData.2013.6691749 | BigData Conference |
Keywords | Field | DocType |
streaming data,clustering performance,pattern clustering,stkfcm algorithm,projection,computational complexity reduction,approximation theory,fuzzy c-means,streaming kernel fuzzy c-means algorithm,scalable algorithms,kernel fuzzy c-means scalable approximation,computational complexity,big data analysis,computation time reduction,kernel clustering,big data,space complexity reduction,kernel matrix | Fuzzy clustering,Data mining,CURE data clustering algorithm,Computer science,Tree kernel,Theoretical computer science,Artificial intelligence,String kernel,Cluster analysis,Data stream clustering,Kernel embedding of distributions,Variable kernel density estimation,Machine learning | Conference |
ISSN | Citations | PageRank |
2639-1589 | 2 | 0.38 |
References | Authors | |
25 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Zijian Zhang | 1 | 27 | 9.14 |
Timothy C. Havens | 2 | 2 | 0.38 |