Title
Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud
Abstract
In order to cope with the growing scale of deep neural network (DNN) models and training data, the use of cloud computing for distributed DNN training is becoming increasingly popular. The amount of available resources in a cloud continuously changes according to users' demands. Although distributed DNN training has a long execution time ranging from several hours to several days, existing frameworks cannot provide a dynamic scale function or have high scale in/out overhead. Therefore, it is difficult to achieve higher performance by adding graphics processing unit (GPU) nodes to a running training cluster, even when surplus GPU resources become available. In addition, the inability to dynamically reconfigure the training cluster prohibits the reform of the cluster topology when it was sub-optimally created. This paper proposes a dynamic scaling technique with which the dynamic addition and removal of new workers can be performed without suspending the ongoing training job. In addition, we propose a heterogeneity-aware straggler-proof technique so that, even when the performance of the GPUs in the cloud are uneven, a performance benefit can be guaranteed through the addition of the surplus resources. The proposed scheme improved throughput by up to a factor of 17.52 during scaling out the existing cluster of five workers to ten compared to the existing checkpoint-based scheme. Furthermore, training was continued at 95.52% of the maximum performance while being stopped for 841 seconds in Elastic Horovod, which supports dynamic scaling. Finally, even when GPUs of different performances were mixed, the error between the determined batch size and the optimal batch size was 3.37% on average.
Year
DOI
Venue
2022
10.1109/ACCESS.2022.3184692
IEEE ACCESS
Keywords
DocType
Volume
Training, Graphics processing units, Computational modeling, Cloud computing, Synchronization, Data models, Computer architecture, Distributed training, neural networks, dynamic scaling, heterogeneous cloud, cluster management, ring-allreduce
Journal
10
ISSN
Citations 
PageRank 
2169-3536
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Kyeonglok Kim100.34
Hyeonsu Lee200.34
Seungmin Oh300.34
Euiseong Seo400.68