Title
Virtual Shuffling for Efficient Data Movement in MapReduce
Abstract
MapReduce is a popular parallel processing framework for large-scale data analytics. To keep up with the increasing volume of datasets, it requires efficient I/O capability from the underlying computer systems to process and analyze data in two phases (mapping and reducing). Between these phases, MapReduce requires a shuffling phase to globally exchange the intermediate data generated by the mapping phase. We reveal that data shuffling, by physically moving segments of intermediate data across disks, causes significant I/O contention and compounds the I/O problem. In this paper, we propose a novel virtual shuffling strategy to enable efficient data movement and reduce I/O for MapReduce shuffling, thereby reducing power consumption and conserving energy. Virtual shuffling is realized through a combination of three techniques including a three-level segment table, near-demand merging, and dynamic and balanced merging subtrees. Our experimental results show that virtual shuffling significantly speeds up data movement in MapReduce and achieves faster job execution. Particularly, its reduction in disk I/O accesses results in as much as 12% savings in power consumption for MapReduce programs.
Year
DOI
Venue
2015
10.1109/TC.2013.216
IEEE Trans. Computers
Keywords
Field
DocType
job execution,mapreduce,power aware computing,parallel processing framework,disk i/o access,parallel programming,i/o contention,virtual shuffling phase,large-scale data analytics,tree data structures,global intermediate data exchange,intermediate data segments,mapping phase,reducing phase,data analysis,energy conservation,three-level segment table,data processing,merging,hadoop,computer systems,mapreduce programs,data movement,virtual shuffling,i/o capability,dynamic-balanced merging subtrees,near-demand merging,power consumption reduction,information management,data models,tuning,computational modeling
Data modeling,Information management,Data analysis,Computer science,Parallel computing,Parallel processing,Real-time computing,Shuffling,Power demand,Merge (version control),Power consumption
Journal
Volume
Issue
ISSN
64
2
0018-9340
Citations 
PageRank 
References 
1
0.34
20
Authors
4
Name
Order
Citations
PageRank
Weikuan Yu1104277.40
Yandong Wang234218.88
Xinyu Que312411.81
Cong Xu4504.38