Abstract | ||
---|---|---|
MapReduce is an important programming model for processing big data with a parallel, distributed algorithm on a cluster. In big data analytic application, equi-join is an important operation. However, it is inefficient to perform equi-join operations in MapReduce when multiple datasets are involved in the join. In this paper, a time cost evaluation model is extended for an equi-join by considering the time cost of calculation. In addition, the sub-joins in an equi-join are classified into star pattern sub-joins on single attribute and chain pattern sub-joins. Based on the extended model, optimization methods are presented and an equi-join plan with lower time cost is chosen for the equi-join. The optimization methods include: the star pattern sub-joins on one attribute are first processed; next, a chain pattern sub-join with minimal scale of intermediate results (i.e. the number of tuples in intermediate results) is processed; at last, a chain pattern sub-join is decomposed into several MapReduce jobs or single MapReduce job by dynamic programming to obtain an optimal scheme for the chain pattern sub-join. We conducted extensive experiments, and the results show that our method is more efficient than those methods such as MDMJ, Hive and Pig. |
Year | DOI | Venue |
---|---|---|
2014 | 10.1007/978-3-319-11194-0_10 | ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2014, PT II |
Keywords | Field | DocType |
Join,MapReduce,Dynamic Programming | Dynamic programming,Programming paradigm,Tuple,Computer science,Parallel computing,Cost evaluation,Distributed algorithm,Big data | Conference |
Volume | ISSN | Citations |
8631 | 0302-9743 | 0 |
PageRank | References | Authors |
0.34 | 9 | 4 |