Title
Adaptive Preshuffling In Hadoop Clusters
Abstract
MapReduce has become an important distributed processing model for large-scale data-intensive applications like data mining and web indexing. Hadoop-an open-source implementation of MapReduce is widely used for short jobs requiring low response time. In this paper, We proposed a new preshuffling strategy in Hadoop to reduce high network loads imposed by shuffle-intensive applications. Designing new shuffling strategies is very appealing for Hadoop clusters where network interconnects are performance bottleneck when the clusters are shared among a large number of applications. The network interconnects are likely to become scarce resource when many shuffle-intensive applications are sharing a Hadoop cluster. We implemented the push model along with the preshuffling scheme in the Hadoop system, where the 2-stage pipeline was incorporated with the preshuffling scheme. We implemented the push model and a pipeline along with the preshuffling scheme in the Hadoop system. Using two Hadoop benchmarks running on the 10-node cluster, we conducted experiments to show that preshuffling-enabled Hadoop clusters are faster than native Hadoop clusters. For example, the push model and the preshuffling scheme powered by the 2-stage pipeline can shorten the execution times of the WordCount and Sort Hadoop applications by an average of 10% and 14%, respectively.
Year
DOI
Venue
2013
10.1016/j.procs.2013.05.422
2013 INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE
Field
DocType
Volume
Web indexing,Cluster (physics),Bottleneck,Computer science,Parallel computing,sort,Response time,Shuffling
Conference
18
ISSN
Citations 
PageRank 
1877-0509
1
0.35
References 
Authors
11
6
Name
Order
Citations
PageRank
Jiong Xie116110.15
Yun Tian21509.81
shu yin330722.05
Ji Zhang4203.75
Xiaojun Ruan539025.87
Xiao Qin61836125.69