Title
Parallel Training Via Computation Graph Transformation
Abstract
Parallel training can speed up the convergence of machine learning models via splitting the workload into multiple accelerators by the wide array of possible parallel paradigms (e.g., data parallelism. model parallelism, attribute parallelism, and pipelining parallelism). However, most machine learning frameworks lack sufficient support for these flexible and sometimes complex parallel training schemes (e.g., TensorFlow does not provide convenient APIs for any paradigm other than data parallelism), and the engineering effort to support all parallelisms in all machine learning frameworks seems gigantic.In this paper, we demonstrate that most parallel training designs/paradigms can be abstracted as a computation graph transformation problem, so that they are realized via computation graph duplication, splitting, augmentation, and assignment to different accelerators, which are then connected by send/recv channels for tensor communications. Furthermore, conducting such computation graph transformations in a portable IR allows the engineering efforts of parallel training to be widely applied across machine learning frameworks.We propose an extensible parallel training search space which describes parallel training schemes in a declarative fashion. We then implement a computation graph transformation compiler that can instantiate the parallel schemes into explicit execution plans, which are readily executable on modern machine learning frameworks (such as TensorFlow). We maximize code reuse by handling parallel configurations and computation graph transformations in extended ONNX, which can be ported to machine learning frameworks by adapting their existing ONNX frontend/backend implementations. Our design reflects a few good themes in machine learning frameworks, including code reuse via powerful IR (as in MLIR) and separation of declaration and realization (as in Halide/TVM).
Year
DOI
Venue
2019
10.1109/BigData47090.2019.9006180
2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)
Keywords
Field
DocType
Parallel Training, Graph Transformation, Machine Learning Frameworks
Pipeline (computing),Computer science,Parallel computing,Compiler,Data parallelism,Artificial intelligence,Graph rewriting,Porting,Code reuse,Machine learning,Executable,Speedup
Conference
ISSN
Citations 
PageRank 
2639-1589
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Fei Wang15415.10
Guoyang Chen2132.83
Weifeng Zhang3239.87
Tiark Rompf474345.86