Title
A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce Environment
Abstract
The specific choice of workload task schedulers for Hadoop MapReduce applications can have a dramatic effect on job workload latency. The Hadoop Fair Scheduler (FairS) assigns resources to jobs such that all jobs get, on average, an equal share of resources over time. Thus, it addresses the problem with a FIFO scheduler when short jobs have to wait for long running jobs to complete. We show that even for the FairS, jobs are still forced to wait significantly when the MapReduce system assigns equal sharing of resources due to dependencies between Map, Shuffle, Sort, Reduce phases. We propose a Hybrid Scheduler (HybS) algorithm based on dynamic priority in order to reduce the latency for variable length concurrent jobs, while maintaining data locality. The dynamic priorities can accommodate multiple task lengths, job sizes, and job waiting times by applying a greedy fractional knapsack algorithm for job task processor assignment. The estimated runtime of Map and Reduce tasks are provided to the HybS dynamic priorities from the historical Hadoop log files. In addition to dynamic priority, we implement a reordering of task processor assignment to account for data availability to automatically maintain the benefits of data locality in this environment. We evaluate our approach by running concurrent workloads consisting of the Word-count and Terasort benchmarks, and a satellite scientific data processing workload and developing a simulator. Our evaluation shows the HybS system improves the average response time for the workloads approximately 2.1x faster over the Hadoop FairS with a standard deviation of 1.4x.
Year
DOI
Venue
2012
10.1109/UCC.2012.32
UCC
Keywords
Field
DocType
data intensive workloads,data locality,hadoop fairs,hadoop fair scheduler,hybrid scheduling algorithm,historical hadoop log file,data availability,mapreduce environment,hybs dynamic priority,concurrent job,hadoop mapreduce application,dynamic priority,reduce task,workflow,scheduling,greedy algorithms,data handling
Fixed-priority pre-emptive scheduling,Workload,Computer science,Scheduling (computing),Parallel computing,Algorithm,Greedy algorithm,Job scheduler,Knapsack problem,Job queue,Hybrid Scheduling,Distributed computing
Conference
ISSN
Citations 
PageRank 
2373-6860
13
0.57
References 
Authors
12
5
Name
Order
Citations
PageRank
Phuong Nguyen1516.56
Tyler Simon2457.29
Milton Halem38629.78
David Chapman4130.90
Quang Le5130.90