Abstract | ||
---|---|---|
We present Gandivafair, a distributed, fair share scheduler that balances conflicting goals of efficiency and fairness in GPU clusters for deep learning training (DLT). Gandivafair provides performance isolation between users, enabling multiple users to share a single cluster, thus, maximizing cluster efficiency. Gandivafair is the first scheduler that allocates cluster-wide GPU time fairly among active users.
Gandivafair achieves efficiency and fairness despite cluster heterogeneity. Data centers host a mix of GPU generations because of the rapid pace at which newer and faster GPUs are released. As the newer generations face higher demand from users, older GPU generations suffer poor utilization, thus reducing cluster efficiency. Gandivafair profiles the variable marginal utility across various jobs from newer GPUs, and transparently incentivizes users to older GPUs by a novel resource trading mechanism that maximizes cluster efficiency without affecting fairness guarantees of any user. With a prototype implementation and evaluation in a heterogeneous 200-GPU cluster, we show that Gandivafair achieves both fairness and efficiency under realistic multi-user workloads.
|
Year | DOI | Venue |
---|---|---|
2020 | 10.1145/3342195.3387555 | EuroSys '20: Fifteenth EuroSys Conference 2020
Heraklion
Greece
April, 2020 |
DocType | ISBN | Citations |
Conference | 978-1-4503-6882-7 | 6 |
PageRank | References | Authors |
0.45 | 12 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Shubham Chaudhary | 1 | 9 | 0.82 |
R. Ramjee | 2 | 3180 | 299.73 |
Muthian Sivathanu | 3 | 300 | 17.82 |
Nipun Kwatra | 4 | 13 | 2.22 |
Srinidhi Viswanatha | 5 | 6 | 0.45 |