Title
3Sigma: distribution-based cluster scheduling for runtime uncertainty
Abstract
The 3Sigma cluster scheduling system uses job runtime histories in a new way. Knowing how long each job will execute enables a scheduler to more effectively pack jobs with diverse time concerns (e.g., deadline vs. the-sooner-the-better) and placement preferences on heterogeneous cluster resources. But, existing schedulers use single-point estimates (e.g., mean or median of a relevant subset of historical runtimes), and we show that they are fragile in the face of real-world estimate error profiles. In particular, analysis of job traces from three different large-scale cluster environments shows that, while the runtimes of many jobs can be predicted well, even state-of-the-art predictors have wide error profiles with 8--23% of predictions off by a factor of two or more. Instead of reducing relevant history to a single point, 3Sigma schedules jobs based on full distributions of relevant runtime histories and explicitly creates plans that mitigate the effects of anticipated runtime uncertainty. Experiments with workloads derived from the same traces show that 3Sigma greatly outperforms a state-of-the-art scheduler that uses point estimates from a state-of-the-art predictor; in fact, the performance of 3Sigma approaches the end-to-end performance of a scheduler based on a hypothetical, perfect runtime predictor. 3Sigma reduces SLO miss rate, increases cluster goodput, and improves or matches latency for best effort jobs.
Year
DOI
Venue
2018
10.1145/3190508.3190515
EuroSys '18: Thirteenth EuroSys Conference 2018 Porto Portugal April, 2018
Field
DocType
ISBN
Point estimation,Scheduling (computing),Latency (engineering),Computer science,Heterogeneous cluster,Schedule,Scheduling system,Goodput,Distributed computing
Conference
978-1-4503-5584-1
Citations 
PageRank 
References 
5
0.41
28
Authors
5
Name
Order
Citations
PageRank
Jun Woo Park11696.47
Alexey Tumanov255424.61
Angela Jiang350.41
Michael A. Kozuch4178282.65
G. R. Ganger520614.55