Abstract | ||
---|---|---|
Among the so-called “4Vs” (volume, velocity, variety, and veracity) that characterize the complexity of Big Data, this paper focuses on the issue of “Volume” in order to ensure good performance for Extracting-Transforming-Loading (ETL) processes. In this study, we propose a new fine-grained parallelization/distribution approach for populating the Data Warehouse (DW). Unlike prior approaches that distribute the ETL only at coarse-grained level of processing, our approach provides different ways of parallelization/distribution both at process, functionality and elementary functions levels. In our approach, an ETL process is described in terms of its core functionalities which can run on a cluster of computers according to the MapReduce (MR) paradigm. The novel approach allows thereby the distribution of the ETL process at three levels: the “process” level for coarse-grained distribution and the “functionality” and “elementary functions” levels for fine-grained distribution. Our performance analysis reveals that employing 25 to 38 parallel tasks enables the novel approach to speed up the ETL process by up to 33% with the improvement rate being linear. |
Year | DOI | Venue |
---|---|---|
2017 | 10.1016/j.datak.2017.08.003 | Data & Knowledge Engineering |
Keywords | Field | DocType |
Data Warehousing,ETL,Parallel and Distributed Processing,Big Data,MapReduce | Data warehouse,Data mining,Computer science,Elementary function,Big data,Database,Speedup | Journal |
Volume | Issue | ISSN |
111 | 1 | 0169-023X |
Citations | PageRank | References |
0 | 0.34 | 11 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Mahfoud Bala | 1 | 0 | 0.68 |
Omar Boussaid | 2 | 312 | 46.88 |
Z. Alimazighi | 3 | 49 | 18.28 |