Title
ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library
Abstract
Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library (ACCL), to build distributed training systems with linear scalability. ACCL provides optimized algor...
Year
DOI
Venue
2021
10.1109/MM.2021.3091475
IEEE Micro
Keywords
DocType
Volume
Servers,Training,Bandwidth,Routing,Fabrics,Payloads,Parallel algorithms
Journal
41
Issue
ISSN
Citations 
5
0272-1732
0
PageRank 
References 
Authors
0.34
0
20
Name
Order
Citations
PageRank
Jianbo Dong141.46
Shaochuang Wang200.34
Fei Feng3261.85
Zheng Cao462.86
Heng Pan500.34
Lingbo Tang6261.85
Pengcheng Li700.34
Hao Li82511.35
Qianyuan Ran900.34
Yiqun Guo1000.34
Shanyuan Gao1100.34
Xin Long1200.34
Jie Zhang134715.01
Yong Li1400.34
Zhisheng Xia1500.34
Liuyihan Song1642.15
Yingya Zhang17213.81
Pan Pan1834.16
Guohui Wang19108860.78
Xiaowei Jiang2051.76