Title
Prediction of the Resource Consumption of Distributed Deep Learning Systems
Abstract
The prediction of the resource consumption for the distributed training of deep learning models is of paramount importance, as it can inform a priori users how long their training would take and also enable users to manage the cost of training. Yet, no such prediction is available for users because the resource consumption itself varies significantly according to "settings" such as GPU types and also by "workloads" like deep learning models. Previous studies have aimed to derive or model such a prediction, but they fall short of accommodating the various combinations of settings and workloads together. This study presents Driple that designs graph neural networks to predict the resource consumption of diverse workloads. Driple also designs transfer learning to extend the graph neural networks to adapt to differences in settings. The evaluation results show that Driple can effectively predict a wide range of workloads and settings. At the same time, Driple can efficiently reduce the time required to tailor the prediction for different settings by up to 7.3x.
Year
DOI
Venue
2022
10.1145/3530895
PROCEEDINGS OF THE ACM ON MEASUREMENT AND ANALYSIS OF COMPUTING SYSTEMS
Keywords
DocType
Volume
distributed deep learning, resource prediction, training time prediction, graph neural networks, transfer learning
Journal
6
Issue
Citations 
PageRank 
2
1
0.41
References 
Authors
0
5
Name
Order
Citations
PageRank
Gyeongsik Yang156.68
Changyong Shin210.41
Jeunghwan Lee310.41
Yeonho Yoo411.09
Chuck Yoo59820.58