Title
Comparative Study of Distributed Deep Learning Tools on Supercomputers.
Abstract
With the growth of the scale of data set and neural networks, the training time is increasing rapidly. Distributed parallel training has been proposed to accelerate deep neural network training, and most efforts are made on top of GPU clusters. This paper focuses on the performance of distributed parallel training in CPU clusters of supercomputer systems. Using resources at the supercomputer system of “Tianhe-2”, we conduct extensive evaluation of the performance of popular deep learning tools, including Caffe, TensorFlow, and BigDL, and several deep neural network models are tested, including AutoEncoder, LeNet, AlexNet and ResNet. The experiment results show that Caffe performs the best in communication efficiency and scalability. BigDL is the fastest in computing speed benefiting from its optimization for CPU, but it suffers from long communication delay due to the dependency on MapReduce framework. The insights and conclusions from our evaluation provides significant reference for improving resource utility of supercomputer resources in distributed deep learning.
Year
Venue
Field
2018
ICA3PP
Tianhe-2,Autoencoder,Supercomputer,Computer science,Parallel computing,Caffè,Artificial intelligence,Deep learning,Artificial neural network,Scalability,Speedup
DocType
Citations 
PageRank 
Conference
1
0.34
References 
Authors
10
7
Name
Order
Citations
PageRank
Xin Du112726.78
Di Kuang210.34
Yan Ye35412.55
Xinxin Li4278.16
Mengqiang Chen510.68
Yunfei Du67214.62
Wei-Gang Wu742548.87