Abstract | ||
---|---|---|
In the last decade we have witnessed a growing interest in process- ing large data sets on large-scale distributed clusters. A big part of the complex data analysis pipelines performed by these systems consists of a sequence of relatively simple query operations, such as joining two or more tables, or sorting. This tutorial discusses several recent algorithmic developments for data processing in such large distributed clusters. It uses as a model of computation the Massively Parallel Computation (MPC) model, a simplification of the BSP model, where the only cost is given by the amount of communication and the number of communication rounds. Based on the MPC model, we study and analyze several algorithms for three core data processing tasks: multiway join queries, sorting and matrix multiplication. We discuss the common algorithmic techniques across all tasks, relate the algorithms to what is used in practical systems, and finally present open problems for future research.
|
Year | DOI | Venue |
---|---|---|
2018 | 10.1145/3183713.3197388 | SIGMOD/PODS '18: International Conference on Management of Data
Houston
TX
USA
June, 2018 |
Keywords | Field | DocType |
Distributed Query Evaluation,Bulk Synchronous Parallel Model | Data mining,Cluster (physics),Data set,Data processing,Computer science,Work in process,Parallel computing,Complex data type,Sorting,Model of computation,Matrix multiplication | Conference |
ISSN | ISBN | Citations |
0730-8078 | 978-1-4503-4703-7 | 0 |
PageRank | References | Authors |
0.34 | 21 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Paraschos Koutris | 1 | 347 | 26.63 |
Semih Salihoglu | 2 | 433 | 24.83 |
Dan Suciu | 3 | 9625 | 1349.54 |