Abstract | ||
---|---|---|
In many application areas, data that is being generated and processed goes beyond the petabyte scale. Analyzing such an increasing massive volume of data faces computational, as well as, statistical challenges. In order to solve these challenges, distributed and parallel processing frameworks have been used for implementing scalable data analysis algorithms. Nevertheless, processing the whole big data set at one time may exceed the available computing resources and the time requirements for some applications. Thus, approximate approaches can be used to achieve asymptotic analysis results, especially when data analysis algorithms are amenable to an approximate result rather than an exact one. However, most approximation approaches require taking a random sample of the data which is a nontrivial task when working with big data sets. In this paper, we employ ensemble learning as an approach for asymptotic analysis using randomly selected subsets (i.e. data blocks) of a big data set. We propose an asymptotic ensemble learning framework which depends on block-based sampling rather than record-based sampling. In order to demonstrate the feasibility and performance of this framework, we present an empirical analysis on real data sets. In addition to the scalability advantage, the experimental results show that several blocks of a data set are enough to get approximately the same results as those from using the whole data set. |
Year | DOI | Venue |
---|---|---|
2016 | 10.1145/3006299.3006306 | Proceedings of the 3rd IEEE/ACM International Conference on Big Data Computing, Applications and Technologies |
Keywords | Field | DocType |
Big Data, Distributed and Parallel Processing, Ensemble Learning, Randomness, Asymptotic Analysis | Data modeling,Data mining,Data set,Petabyte,Computer science,Theoretical computer science,Artificial intelligence,Asymptotic analysis,Ensemble learning,Sampling (statistics),Big data,Machine learning,Scalability | Conference |
ISBN | Citations | PageRank |
978-1-5090-4468-9 | 3 | 0.44 |
References | Authors | |
12 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Salman Salloum | 1 | 15 | 1.72 |
Joshua Zhexue Huang | 2 | 1365 | 82.64 |
Yu-Lin He | 3 | 90 | 6.31 |