Title
Load-balancing distributed outer joins through operator decomposition
Abstract
High-performance data analytics largely relies on being able to efficiently execute various distributed data operators such as distributed joins. So far, large amounts of join methods have been proposed and evaluated in parallel and distributed environments. However, most of them focus on inner joins, and there is little published work providing the detailed implementations and analysis of outer joins. In this work, we present POPI (Partial Outer join & Partial Inner join), a novel method to load-balance large parallel outer joins by decomposing them into two operations: a large outer join over data that does not present significant skew in the input and an inner join over data presenting significant skew. We present the detailed implementation of our approach and show that POPI is implementable over a variety of architectures and underlying join implementations. Moreover, our experimental evaluation over a distributed memory platform also demonstrates that the proposed method is able to improve outer join performance under varying data skew and present excellent load-balancing properties, compared to current approaches.
Year
DOI
Venue
2019
10.1016/j.jpdc.2019.05.008
Journal of Parallel and Distributed Computing
Keywords
Field
DocType
Distributed join,Outer join,Data skew,Load balancing,Spark
Joins,Data analysis,Computer science,Load balancing (computing),Parallel computing,Distributed memory,Implementation,Operator (computer programming),Skew,Distributed computing
Journal
Volume
ISSN
Citations 
132
0743-7315
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Long Cheng19116.99
Spyros Kotoulas259046.46
Qingzhi Liu311.70
Ying Wang427655.61