Abstract | ||
---|---|---|
The amount of available resources of a cloud is constantly changing. However, the current distributed DNN framework does not allow dynamic scaling of a training cluster. Therefore, a cloud-based training cluster cannot flexibly scale in response to the dynamically changing resource availability. To resolve this issue, we propose a dynamic scaling scheme for cloud- based DNN training clusters. In the proposed approach, a cluster manages a separate communication pool for orchestrating scaling operations, and a new node synchronizes its weight tensors through eavesdropping gradient exchanges before it actually participates the training operation. Our evaluation showed that the proposed approach reduced the scaling overhead by 13% in comparison to the conventional checkpoint-restore approach, and revealed the possibilities of further improvement. |
Year | DOI | Venue |
---|---|---|
2020 | 10.1109/SmartCloud49737.2020.00039 | 2020 IEEE International Conference on Smart Cloud (SmartCloud) |
Keywords | DocType | ISBN |
distributed training,training clusters,deep neural network,GPU computing,cloud computing | Conference | 978-1-7281-6548-6 |
Citations | PageRank | References |
0 | 0.34 | 3 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Seungmin Oh | 1 | 0 | 0.34 |
Kyeonglok Kim | 2 | 0 | 0.34 |
Euiseong Seo | 3 | 0 | 0.34 |