Title
A Dynamic Scaling Scheme of Cloud-based DNN Training Clusters
Abstract
The amount of available resources of a cloud is constantly changing. However, the current distributed DNN framework does not allow dynamic scaling of a training cluster. Therefore, a cloud-based training cluster cannot flexibly scale in response to the dynamically changing resource availability. To resolve this issue, we propose a dynamic scaling scheme for cloud- based DNN training clusters. In the proposed approach, a cluster manages a separate communication pool for orchestrating scaling operations, and a new node synchronizes its weight tensors through eavesdropping gradient exchanges before it actually participates the training operation. Our evaluation showed that the proposed approach reduced the scaling overhead by 13% in comparison to the conventional checkpoint-restore approach, and revealed the possibilities of further improvement.
Year
DOI
Venue
2020
10.1109/SmartCloud49737.2020.00039
2020 IEEE International Conference on Smart Cloud (SmartCloud)
Keywords
DocType
ISBN
distributed training,training clusters,deep neural network,GPU computing,cloud computing
Conference
978-1-7281-6548-6
Citations 
PageRank 
References 
0
0.34
3
Authors
3
Name
Order
Citations
PageRank
Seungmin Oh100.34
Kyeonglok Kim200.34
Euiseong Seo300.34