Title
Efficiently Handling Skew in Outer Joins on Distributed Systems
Abstract
Outer joins are ubiquitous in databases and big data systems. The question of how best to execute outer joins in large parallel systems is particularly challenging as real world datasets are characterized by data skew leading to performance issues. Although skew handling techniques have been extensively studied for inner joins, there is little published work solving the corresponding problem for parallel outer joins. Conventional approaches to this problem such as ones based on hash redistribution often lead to load balancing problems while duplication-based approaches incurs significant overhead in terms of network communication. In this paper, we propose a new algorithm, query with counters (QC), for directly handling skew in outer joins on distributed architectures. We present an efficient implementation of our approach based on the asynchronous partitioned global address space (APGAS) parallel programming model. We evaluate the performance of our approach on a cluster of 192 cores (16 nodes) and datasets of 1 billion tuples with different skew. Experimental results show that our method is scalable and, in cases of high skew, faster than the state-of-the-art.
Year
DOI
Venue
2014
10.1109/CCGrid.2014.35
CCGrid
Keywords
Field
DocType
asynchronous partitioned global address space,x10,query with counters,parallel programming model,distributed systems,distributed join,parallel programming,network communication,load balancing problems,parallel join,apgas,parallel systems,resource allocation,distributed architectures,outer join,big data systems,hash redistribution,performance issues,qc,parallel outer joins,data skew,data handling,performance evaluation,databases,skew handling,duplication-based approaches,silicon,distributed databases,scalability,histograms,radiation detectors
Joins,Tuple,Computer science,Load balancing (computing),Parallel computing,Parallel programming model,Hash function,Skew,Distributed database,Partitioned global address space,Distributed computing
Conference
ISSN
Citations 
PageRank 
2376-4414
7
0.50
References 
Authors
13
4
Name
Order
Citations
PageRank
Long Cheng19116.99
Spyros Kotoulas259046.46
Tomas E. Ward310419.10
Georgios Theodoropoulos433231.39