DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training - Citegraph

Paper Info

Title
DeepClone: Lightweight State Replication of Deep Learning Models for Data Parallel Training

Abstract
Training modern deep neural network (DNN) models involves complex workflows triggered by model exploration, sensitivity analysis, explainability, etc. A key primitive in this context is the ability to clone a model training instance, i.e. “fork” the training process in a potentially different direction, which enables comparisons of different evolution paths using variations of training data and model parameters. However, in a quest improve the training throughput, a mix of data parallel, model parallel, pipeline parallel and layer-wise parallel approaches are making the problem of cloning highly complex. In this paper, we explore the problem of efficient cloning under such circumstances. To this end, we leverage several properties of data-parallel training and layer-wise parallelism to design DeepClone, a cloning approach based on augmenting the execution graph to gain direct access to tensors, which are then sharded and reconstructed asynchronously in order to minimize runtime overhead, standby duration, readiness duration. Compared with state-of-art approaches, DeepClone shows orders of magnitude improvement for several classes of DNN models.

Year	DOI	Venue
2020	10.1109/CLUSTER49012.2020.00033	2020 IEEE International Conference on Cluster Computing (CLUSTER)
Keywords	DocType	ISSN
deep learning,data-parallel training,layer-wise parallelism,model cloning,state replication	Conference	1552-5244
ISBN	Citations	PageRank
978-1-7281-6678-0	1	0.36
References	Authors
22	4

Authors (4 rows)

Cited by (1 rows)

References (22 rows)

Name	Order	Citations	PageRank
Bogdan Nicolae	1	392	29.51
Justin M. Wozniak	2	464	35.32
matthieu dorier	3	131	13.91
Franck Cappello	4	3775	251.47

1