Title
DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training
Abstract
Training modern deep neural network (DNN) models involves complex workflows triggered by model exploration, sensitivity analysis, explainability, etc. A key primitive in this context is the ability to clone a model training instance, i.e. “fork” the training process in a potentially different direction, which enables comparisons of different evolution paths using variations of training data and model parameters. However, in a quest improve the training throughput, a mix of data parallel, model parallel, pipeline parallel and layer-wise parallel approaches are making the problem of cloning highly complex. In this paper, we explore the problem of efficient cloning under such circumstances. To this end, we leverage several properties of data-parallel training and layer-wise parallelism to design DeepClone, a cloning approach based on augmenting the execution graph to gain direct access to tensors, which are then sharded and reconstructed asynchronously in order to minimize runtime overhead, standby duration, readiness duration. Compared with state-of-art approaches, DeepClone shows orders of magnitude improvement for several classes of DNN models.
Year
DOI
Venue
2020
10.1109/CLUSTER49012.2020.00033
2020 IEEE International Conference on Cluster Computing (CLUSTER)
Keywords
DocType
ISSN
deep learning,data-parallel training,layer-wise parallelism,model cloning,state replication
Conference
1552-5244
ISBN
Citations 
PageRank 
978-1-7281-6678-0
1
0.36
References 
Authors
22
4
Name
Order
Citations
PageRank
Bogdan Nicolae139229.51
Justin M. Wozniak246435.32
matthieu dorier313113.91
Franck Cappello43775251.47