Title | ||
---|---|---|
Liquid: Intelligent Resource Estimation and Network-Efficient Scheduling for Deep Learning Jobs on Distributed GPU Clusters |
Abstract | ||
---|---|---|
Deep learning (DL) is becoming increasingly popular in many domains, including computer vision, speech recognition, self-driving automobiles, etc. GPU can train DL models efficiently but is expensive, which motivates users to share GPU resource to reduce money costs in practice. To ensure efficient sharing among multiple users, it is necessary to develop efficient GPU resource management and scheduling solutions. However, existing ones have several shortcomings. First, they require the users to specify the job resource requirement which is usually quite inaccurate and leads to cluster resource underutilization. Second, when scheduling DL jobs, they rarely take the cluster network characteristics into consideration, resulting in low job execution performance. To overcome the above issues, we propose Liquid, an efficient GPU resource management platform for DL jobs with intel
<underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">li</u>
gent resource re
<underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">qui</u>
rement estimation and sche
<underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">d</u>
uling. First, we propose a regression model based method for job resource requirement estimation to avoid users over-allocating computing resources. Second, we propose intelligent cluster network-efficient scheduling methods in both immediate and batch modes based on the above resource requirement estimation techniques. Third, we further propose three system-level optimizations, including pre-scheduling data transmission, fine-grained GPU sharing, and event-driven communication. Experimental results show that our Liquid can accelerate the job execution speed by 18% on average and shorten the average job completion time (JCT) by 21% compared with cutting-edge solutions. Moreover, the proposed optimization methods are effective in various scenarios. |
Year | DOI | Venue |
---|---|---|
2022 | 10.1109/TPDS.2021.3138825 | IEEE Transactions on Parallel and Distributed Systems |
Keywords | DocType | Volume |
Job scheduling,resource management,deep learning,GPU clusters | Journal | 33 |
Issue | ISSN | Citations |
11 | 1045-9219 | 2 |
PageRank | References | Authors |
0.36 | 8 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Rong Gu | 1 | 110 | 17.77 |
Yuquan Chen | 2 | 2 | 0.36 |
Shuai Liu | 3 | 2 | 0.36 |
Dai Haipeng | 4 | 419 | 55.44 |
guihai chen | 5 | 3537 | 317.28 |
Kai Zhang | 6 | 2 | 1.04 |
Yang Che | 7 | 2 | 0.36 |
Huang, Yihua | 8 | 167 | 22.07 |