Title
Smart Shuffling in MapReduce: A Solution to Balance Network Traffic and Workloads
Abstract
In the context of Hadoop, recent studies show that the shuffle operation accounts for as much as a third of the completion time of a MapReduce job. Consequently, the shuffle phase constitutes a crucial aspect of the scheduling of such jobs. During a shuffle phase, the job scheduler assigns reduce tasks to a set of reduce nodes. This may require multiple intermediate data items which share a key to be relocated to this new set of reduce nodes. In turn, this could cause a large volume of simultaneous data relocations within the network. Intuitively, a reduce task experiences shorter access latency if its required items are available locally or in close proximity. This, however, may also result in a hotspot in the network due to imbalanced traffic, as well as the imbalance of the workload on different nodes, regardless of their homogeneity. In this paper, we study data relocation incurred during the shuffle stage in the MapReduce framework. Within an arbitrary network, we aim at a) minimizing the overall network traffic, b) achieving workload balancing, and c) eliminating network hotspots, in order to improve the overall performance. Our contribution consists of the development of a scheduler that satisfies these three goals. We then present an in-depth simulation. Our results show that, for arbitrary network topologies, our Smart Shuffling Scheduler systematically outperforms the CoGRS scheduler in terms of hotspot elimination as well as reduce task load balancing, while ensuring traffic caused by data relocation is low. Not only does our algorithm handle any topology but also its benefits are inversely proportional to the inter-node connectivity of the network topology: the lower this connectivity, the better our algorithm. In particular, for the tree topology commonly used within data centres, our proposed scheduler offers significant improvements over the CoGRS scheduler.
Year
DOI
Venue
2015
10.1109/UCC.2015.18
2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC)
Keywords
Field
DocType
Hadoop,Mapreduce,Network Traffic,Network Workload Balance
Computer science,Scheduling (computing),Load balancing (computing),Network scheduler,Network simulation,Network topology,Shuffling,Job scheduler,Network traffic control,Distributed computing
Conference
ISSN
Citations 
PageRank 
2373-6860
1
0.37
References 
Authors
15
6
Name
Order
Citations
PageRank
Wei Shi116422.61
Yang Wang218845.73
Jean-pierre Corriveau37919.46
Boqiang Niu410.37
William Lee Croft510.71
Mengfei Peng630.76