Title
On the Feasibility of Distributed Kernel Regression for Big Data
Abstract
In Big Data applications, massive datasets with huge numbers of observations are frequently encountered. To deal with such massive datasets, a divide-and-conquer scheme (e.g., MapReduce) is often used for the analysis of Big Data. With such a strategy, a large dataset (e.g., a centralized real database or a virtual database with distributed data sources) is first divided into smaller manageable segments; the final output is then aggregated from the individual outputs of the segments. Despite its popularity in practice, it remains largely unknown whether such a distributive strategy provides valid theoretical inferences to the original data. In this paper, we address this fundamental issue for the distributed kernel regression (DKR) problem, where the algorithmic feasibility is measured by the generalization performance of the resulting estimator. To justify DKR, a uniform convergence rate is needed for bounding the generalization error over the individual outputs, which brings new and challenging issues in the Big Data setup. Using a sample dependent kernel dictionary, we show that, with proper data segmentation, DKR leads to an estimator that is generalization consistent to the unknown regression function. This result theoretically justifies DKR and sheds light on more advanced distributive algorithms for processing Big Data. The promising performance of the method is supported by both simulation and real data examples.
Year
DOI
Venue
2016
10.1109/TKDE.2016.2594060
IEEE Trans. Knowl. Data Eng.
Keywords
DocType
Volume
Big data,Kernel,Distributed databases,Distributed algorithms,Estimation,Data models,Algorithm design and analysis
Journal
28
Issue
ISSN
Citations 
11
1041-4347
3
PageRank 
References 
Authors
0.56
21
3
Name
Order
Citations
PageRank
chen xu130.90
yongquan zhang230.56
Runze Li311220.80