Title
Distributed Subdata Selection For Big Data Via Sampling-Based Approach
Abstract
With the development of modern technologies, it is possible to gather an extraordinarily large number of observations. Due to the storage or transmission burden, big data are usually scattered at multiple locations. It is difficult to transfer all of data to the central server for analysis. A distributed subdata selection method for big data linear regression model is proposed. Particularly, a two-step subsampling strategy with optimal subsampling probabilities and optimal allocation sizes is developed. The subsample-based estimator effectively approximates the ordinary least squares estimator from the full data. The convergence rate and asymptotic normality of the proposed estimator are established. Simulation studies and an illustrative example about airline data are provided to assess the performance of the proposed method. (C) 2020 Elsevier B.V. All rights reserved.
Year
DOI
Venue
2021
10.1016/j.csda.2020.107072
COMPUTATIONAL STATISTICS & DATA ANALYSIS
Keywords
DocType
Volume
Allocation sizes, Big data, Distributed subsampling, Optimal subsampling, Regression diagnostic
Journal
153
ISSN
Citations 
PageRank 
0167-9473
1
0.35
References 
Authors
0
2
Name
Order
Citations
PageRank
Haixiang Zhang16412.19
haiying243.72