Abstract | ||
---|---|---|
Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it becomes very costly to communicate all of this data to a single data center and then perform kernel PCA. Can we perform kernel PCA on the entire dataset in a distributed and communication efficient fashion while maintaining provable and strong guarantees in solution quality? In this paper, we give an affirmative answer to the question by developing a communication efficient algorithm to perform kernel PCA in the distributed setting. The algorithm is a clever combination of subspace embedding and adaptive sampling techniques, and we show that the algorithm can take as input an arbitrary configuration of distributed datasets, and compute a set of global kernel principal components with relative error guarantees independent of the dimension of the feature space or the total number of data points. In particular, computing k principal components with relative error ε over s workers has communication cost Õ(spk/ε+sk2/ε3) words, where p is the average number of nonzero entries in each data point. Furthermore, we experimented the algorithm with large-scale real world datasets and showed that the algorithm produces a high quality kernel PCA solution while using significantly less communication than alternative approaches. |
Year | DOI | Venue |
---|---|---|
2016 | 10.1145/2939672.2939796 | KDD |
Keywords | Field | DocType |
Kernel method,Principal Component Analysis,distributed computing | Data mining,Radial basis function kernel,Computer science,Kernel principal component analysis,Tree kernel,Polynomial kernel,Artificial intelligence,String kernel,Pattern recognition,Kernel embedding of distributions,Kernel method,Variable kernel density estimation,Machine learning | Conference |
Citations | PageRank | References |
1 | 0.40 | 6 |
Authors | ||
5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Maria-Florina Balcan | 1 | 1445 | 105.01 |
Yingyu Liang | 2 | 393 | 31.39 |
Le Song | 3 | 2437 | 159.27 |
David P. Woodruff | 4 | 2156 | 142.38 |
Bo Xie 0002 | 5 | 94 | 5.19 |