A Dynamic Scaling Scheme of Cloud-based DNN Training Clusters - Citegraph

Paper Info

Title
A Dynamic Scaling Scheme of Cloud-based DNN Training Clusters

Abstract
The amount of available resources of a cloud is constantly changing. However, the current distributed DNN framework does not allow dynamic scaling of a training cluster. Therefore, a cloud-based training cluster cannot flexibly scale in response to the dynamically changing resource availability. To resolve this issue, we propose a dynamic scaling scheme for cloud- based DNN training clusters. In the proposed approach, a cluster manages a separate communication pool for orchestrating scaling operations, and a new node synchronizes its weight tensors through eavesdropping gradient exchanges before it actually participates the training operation. Our evaluation showed that the proposed approach reduced the scaling overhead by 13% in comparison to the conventional checkpoint-restore approach, and revealed the possibilities of further improvement.

Year	DOI	Venue
2020	10.1109/SmartCloud49737.2020.00039	2020 IEEE International Conference on Smart Cloud (SmartCloud)
Keywords	DocType	ISBN
distributed training,training clusters,deep neural network,GPU computing,cloud computing	Conference	978-1-7281-6548-6
Citations	PageRank	References
0	0.34	3
Authors
3

Authors (3 rows)

Cited by (0 rows)

References (3 rows)

Name	Order	Citations	PageRank
Seungmin Oh	1	0	0.34
Kyeonglok Kim	2	0	0.34
Euiseong Seo	3	0	0.34

1