Abstract | ||
---|---|---|
We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific knowledge to improve latency and efficiency of training deep learning models in a GPU cluster.One key characteristic of deep learning is feedback-driven exploration, where a user often runs a set of jobs (or a multi-job) to achieve the best result for a specific mission and uses early feedback on accuracy to dynamically prioritize or kill a subset of jobs; simultaneous early feedback on the entire multi-job is critical. A second characteristic is the heterogeneity of deep learning jobs in terms of resource usage, making it hard to achieve best-fit a priori. Gandiva addresses these two challenges by exploiting a third key characteristic of deep learning: intra-job predictability, as they perform numerous repetitive iterations called mini-batch iterations. Gandiva exploits intra-job predictability to time-slice GPUs efficiently across multiple jobs, thereby delivering low-latency. This predictability is also used for introspecting job performance and dynamically migrating jobs to better-fit GPUs, thereby improving cluster efficiency.We show via a prototype implementation and micro-benchmarks that Gandiva can speed up hyper-parameter searches during deep learning by up to an order of magnitude, and achieves better utilization by transparently migrating and time-slicing jobs to achieve better job-to-resource fit. We also show that, in a real workload of jobs running in a 180-GPU cluster, Gandiva improves aggregate cluster utilization by 26%, pointing to a new way of managing large GPU clusters for deep learning. |
Year | Venue | Field |
---|---|---|
2018 | OSDI | Predictability,GPU cluster,Scheduling (computing),Workload,Computer science,A priori and a posteriori,Exploit,Artificial intelligence,Deep learning,Distributed computing,Speedup |
DocType | Citations | PageRank |
Conference | 6 | 0.40 |
References | Authors | |
0 | 12 |
Name | Order | Citations | PageRank |
---|---|---|---|
Wencong Xiao | 1 | 83 | 6.46 |
Romil Bhardwaj | 2 | 12 | 2.21 |
R. Ramjee | 3 | 3180 | 299.73 |
Muthian Sivathanu | 4 | 300 | 17.82 |
Nipun Kwatra | 5 | 13 | 2.22 |
Zhenhua Han | 6 | 9 | 1.49 |
Pratyush Patel | 7 | 6 | 0.40 |
Xuan Peng | 8 | 9 | 0.81 |
Hanyu Zhao | 9 | 9 | 1.49 |
Quanlu Zhang | 10 | 25 | 5.14 |
Fan Yang | 11 | 127 | 7.10 |
Lidong Zhou | 12 | 2136 | 147.82 |