Title
Design and evaluation of small-large outer joins in cloud computing environments.
Abstract
Large-scale analytics is a key application area for data processing and parallel computing research. One of the most common (and challenging) operations in this domain is the join. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially in the extremely popular cloud computing environments. A common type of outer join is the small–large outer join, where one relation is relatively small and the other is large. Conventional implementations on this condition, such as one based on hash redistribution, often incur significant network communication, while the duplication-based approaches are complex and inefficient. In this work, we present a new method called DDR (duplication and direct redistribution), which aims to enable efficient small–large outer joins in cloud computing environments while being easy to implement using existing predicates in data processing frameworks. We present the detailed implementation of our approach and evaluate its performance through extensive experiments over the widely used MapReduce and Spark platforms. We show that the proposed method is scalable and can achieve significant performance improvements over the conventional approaches. Compared to the state-of-art method, the DDR algorithm is shown to be easier to implement and can achieve very similar or better performance under different outer join workloads, and thus, can be considered as a new option for current data analysis applications. Moreover, our detailed experimental results also have provided insights of current small–large outer join implementations, thereby allowing system developers to make a more informed choice for their data analysis applications.
Year
DOI
Venue
2017
10.1016/j.jpdc.2017.02.007
Journal of Parallel and Distributed Computing
Keywords
Field
DocType
Parallel joins,Outer joins,Small–large joins,Cloud computing,Performance evaluation
Joins,Data processing,Spark (mathematics),Computer science,Parallel computing,Implementation,Hash function,Analytics,Distributed computing,Cloud computing,Scalability
Journal
Volume
ISSN
Citations 
110
0743-7315
8
PageRank 
References 
Authors
0.45
30
4
Name
Order
Citations
PageRank
Long Cheng19116.99
Ilias Tachmazidis24811.44
Spyros Kotoulas359046.46
Grigoris Antoniou42401190.28