Title
An Uncoupled Data Process and Transfer Model for MapReduce.
Abstract
In the original MapReduce model, reduce tasks need to fetch output data of map tasks in the manner of "pull". However, reduce tasks which are occupying reduce slots cannot start executing until all the corresponding map tasks are completed. It forms the dependence between map and reduce tasks, which is called the coupled relationship in this paper. The coupled relationship leads to two problems: reduce slot hoarding and underutilized network bandwidth. Meanwhile, storing the result data is costly especially when the system has replications, which leads to the inefficient storage problem. We propose an uncoupled data process and transfer model in order to address these problems. Four core techniques, including weighted mapping, data pushing, partial data backup, and data compression are introduced and applied in Apache Hadoop, the mainstream open-source implementation of MapReduce model. This work has been practiced in Baidu, the biggest search engine company in China. A real-world application for web data processing shows that our model can improve the system throughput by 29.5%, reduce the total wall time by 22.8%, provide a weighted wall time acceleration of 26.3%, and reduce the result data stored in disk by 70%. What's more, the implementation of this model is transparent to users and compatible with the original Hadoop.
Year
DOI
Venue
2015
10.1007/978-3-662-46335-2_2
Lecture Notes in Computer Science
Keywords
Field
DocType
MapReduce,Data transfer,Uncoupled model,Compression
Data processing,Search engine,Data transmission,Computer science,Parallel computing,Bandwidth (signal processing),Acceleration,Throughput,Data compression,Backup,Distributed computing
Journal
Volume
ISSN
Citations 
8970
0302-9743
0
PageRank 
References 
Authors
0.34
13
4
Name
Order
Citations
PageRank
Li Zha100.34
Jie Zhang24715.01
Wei Liu300.68
Jian Lin4348.22