Gandiva: Introspective Cluster Scheduling for Deep Learning. - Citegraph

Paper Info

Title
Gandiva: Introspective Cluster Scheduling for Deep Learning.

Abstract
We introduce Gandiva, a new cluster scheduling framework that utilizes domain-specific knowledge to improve latency and efficiency of training deep learning models in a GPU cluster.One key characteristic of deep learning is feedback-driven exploration, where a user often runs a set of jobs (or a multi-job) to achieve the best result for a specific mission and uses early feedback on accuracy to dynamically prioritize or kill a subset of jobs; simultaneous early feedback on the entire multi-job is critical. A second characteristic is the heterogeneity of deep learning jobs in terms of resource usage, making it hard to achieve best-fit a priori. Gandiva addresses these two challenges by exploiting a third key characteristic of deep learning: intra-job predictability, as they perform numerous repetitive iterations called mini-batch iterations. Gandiva exploits intra-job predictability to time-slice GPUs efficiently across multiple jobs, thereby delivering low-latency. This predictability is also used for introspecting job performance and dynamically migrating jobs to better-fit GPUs, thereby improving cluster efficiency.We show via a prototype implementation and micro-benchmarks that Gandiva can speed up hyper-parameter searches during deep learning by up to an order of magnitude, and achieves better utilization by transparently migrating and time-slicing jobs to achieve better job-to-resource fit. We also show that, in a real workload of jobs running in a 180-GPU cluster, Gandiva improves aggregate cluster utilization by 26%, pointing to a new way of managing large GPU clusters for deep learning.

Year	Venue	Field
2018	OSDI	Predictability,GPU cluster,Scheduling (computing),Workload,Computer science,A priori and a posteriori,Exploit,Artificial intelligence,Deep learning,Distributed computing,Speedup
DocType	Citations	PageRank
Conference	6	0.40
References	Authors
0	12

Authors (12 rows)

Cited by (6 rows)

References (0 rows)

Name	Order	Citations	PageRank
Wencong Xiao	1	83	6.46
Romil Bhardwaj	2	12	2.21
R. Ramjee	3	3180	299.73
Muthian Sivathanu	4	300	17.82
Nipun Kwatra	5	13	2.22
Zhenhua Han	6	9	1.49
Pratyush Patel	7	6	0.40
Xuan Peng	8	9	0.81
Hanyu Zhao	9	9	1.49
Quanlu Zhang	10	25	5.14
Fan Yang	11	127	7.10
Lidong Zhou	12	2136	147.82

1