Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud - Citegraph

Paper Info

Title
Scale-Train: A Scalable DNN Training Framework for a Heterogeneous GPU Cloud

Abstract
In order to cope with the growing scale of deep neural network (DNN) models and training data, the use of cloud computing for distributed DNN training is becoming increasingly popular. The amount of available resources in a cloud continuously changes according to users' demands. Although distributed DNN training has a long execution time ranging from several hours to several days, existing frameworks cannot provide a dynamic scale function or have high scale in/out overhead. Therefore, it is difficult to achieve higher performance by adding graphics processing unit (GPU) nodes to a running training cluster, even when surplus GPU resources become available. In addition, the inability to dynamically reconfigure the training cluster prohibits the reform of the cluster topology when it was sub-optimally created. This paper proposes a dynamic scaling technique with which the dynamic addition and removal of new workers can be performed without suspending the ongoing training job. In addition, we propose a heterogeneity-aware straggler-proof technique so that, even when the performance of the GPUs in the cloud are uneven, a performance benefit can be guaranteed through the addition of the surplus resources. The proposed scheme improved throughput by up to a factor of 17.52 during scaling out the existing cluster of five workers to ten compared to the existing checkpoint-based scheme. Furthermore, training was continued at 95.52% of the maximum performance while being stopped for 841 seconds in Elastic Horovod, which supports dynamic scaling. Finally, even when GPUs of different performances were mixed, the error between the determined batch size and the optimal batch size was 3.37% on average.

Year	DOI	Venue
2022	10.1109/ACCESS.2022.3184692	IEEE ACCESS
Keywords	DocType	Volume
Training, Graphics processing units, Computational modeling, Cloud computing, Synchronization, Data models, Computer architecture, Distributed training, neural networks, dynamic scaling, heterogeneous cloud, cluster management, ring-allreduce	Journal	10
ISSN	Citations	PageRank
2169-3536	0	0.34
References	Authors
0	4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Kyeonglok Kim	1	0	0.34
Hyeonsu Lee	2	0	0.34
Seungmin Oh	3	0	0.34
Euiseong Seo	4	0	0.68

1