Title
Equi-join for Multiple Datasets Based on Time Cost Evaluation Model.
Abstract
MapReduce is an important programming model for processing big data with a parallel, distributed algorithm on a cluster. In big data analytic application, equi-join is an important operation. However, it is inefficient to perform equi-join operations in MapReduce when multiple datasets are involved in the join. In this paper, a time cost evaluation model is extended for an equi-join by considering the time cost of calculation. In addition, the sub-joins in an equi-join are classified into star pattern sub-joins on single attribute and chain pattern sub-joins. Based on the extended model, optimization methods are presented and an equi-join plan with lower time cost is chosen for the equi-join. The optimization methods include: the star pattern sub-joins on one attribute are first processed; next, a chain pattern sub-join with minimal scale of intermediate results (i.e. the number of tuples in intermediate results) is processed; at last, a chain pattern sub-join is decomposed into several MapReduce jobs or single MapReduce job by dynamic programming to obtain an optimal scheme for the chain pattern sub-join. We conducted extensive experiments, and the results show that our method is more efficient than those methods such as MDMJ, Hive and Pig.
Year
DOI
Venue
2014
10.1007/978-3-319-11194-0_10
ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2014, PT II
Keywords
Field
DocType
Join,MapReduce,Dynamic Programming
Dynamic programming,Programming paradigm,Tuple,Computer science,Parallel computing,Cost evaluation,Distributed algorithm,Big data
Conference
Volume
ISSN
Citations 
8631
0302-9743
0
PageRank 
References 
Authors
0.34
9
4
Name
Order
Citations
PageRank
Hong Zhu1477.49
Libo Xia200.34
Hong Zhu375.52
Ke Yan431.75