Parallel Training Via Computation Graph Transformation - Citegraph

Paper Info

Title
Parallel Training Via Computation Graph Transformation

Abstract
Parallel training can speed up the convergence of machine learning models via splitting the workload into multiple accelerators by the wide array of possible parallel paradigms (e.g., data parallelism. model parallelism, attribute parallelism, and pipelining parallelism). However, most machine learning frameworks lack sufficient support for these flexible and sometimes complex parallel training schemes (e.g., TensorFlow does not provide convenient APIs for any paradigm other than data parallelism), and the engineering effort to support all parallelisms in all machine learning frameworks seems gigantic.In this paper, we demonstrate that most parallel training designs/paradigms can be abstracted as a computation graph transformation problem, so that they are realized via computation graph duplication, splitting, augmentation, and assignment to different accelerators, which are then connected by send/recv channels for tensor communications. Furthermore, conducting such computation graph transformations in a portable IR allows the engineering efforts of parallel training to be widely applied across machine learning frameworks.We propose an extensible parallel training search space which describes parallel training schemes in a declarative fashion. We then implement a computation graph transformation compiler that can instantiate the parallel schemes into explicit execution plans, which are readily executable on modern machine learning frameworks (such as TensorFlow). We maximize code reuse by handling parallel configurations and computation graph transformations in extended ONNX, which can be ported to machine learning frameworks by adapting their existing ONNX frontend/backend implementations. Our design reflects a few good themes in machine learning frameworks, including code reuse via powerful IR (as in MLIR) and separation of declaration and realization (as in Halide/TVM).

Year	DOI	Venue
2019	10.1109/BigData47090.2019.9006180	2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)
Keywords	Field	DocType
Parallel Training, Graph Transformation, Machine Learning Frameworks	Pipeline (computing),Computer science,Parallel computing,Compiler,Data parallelism,Artificial intelligence,Graph rewriting,Porting,Code reuse,Machine learning,Executable,Speedup	Conference
ISSN	Citations	PageRank
2639-1589	0	0.34
References	Authors
0	4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Fei Wang	1	54	15.10
Guoyang Chen	2	13	2.83
Weifeng Zhang	3	23	9.87
Tiark Rompf	4	743	45.86

1