Title
Improving MapReduce performance in heterogeneous environments
Abstract
MapReduce is emerging as an important programming model for large-scale data-parallel applications such as web indexing, data mining, and scientific simulation. Hadoop is an open-source implementation of MapReduce enjoying wide adoption and is often used for short jobs where low response time is critical. Hadoop's performance is closely tied to its task scheduler, which implicitly assumes that cluster nodes are homogeneous and tasks make progress linearly, and uses these assumptions to decide when to speculatively re-execute tasks that appear to be stragglers. In practice, the homogeneity assumptions do not always hold. An especially compelling setting where this occurs is a virtualized data center, such as Amazon's Elastic Compute Cloud (EC2). We show that Hadoop's scheduler can cause severe performance degradation in heterogeneous environments. We design a new scheduling algorithm, Longest Approximate Time to End (LATE), that is highly robust to heterogeneity. LATE can improve Hadoop response times by a factor of 2 in clusters of 200 virtual machines on EC2.
Year
Venue
Keywords
2008
OSDI
data mining,cluster node,hadoop response time,low response time,compelling setting,virtualized data center,elastic compute cloud,heterogeneous environment,longest approximate time,task scheduler,severe performance degradation,improving mapreduce performance,virtual machine,scheduling algorithm,programming model,data center,indexation
Field
DocType
Citations 
Web indexing,Cluster (physics),Virtual machine,Programming paradigm,Computer science,Scheduling (computing),Response time,Real-time computing,Data center,Distributed computing,Cloud computing
Conference
733
PageRank 
References 
Authors
41.66
13
5
Search Limit
100733
Name
Order
Citations
PageRank
Matei Zaharia19101407.89
Andy Konwinski23400158.39
D. Joseph35463492.96
Randy H. Katz4168193018.89
I. Stoica5214061710.11