Title
TensorExpress: In-Network Communication Scheduling for Distributed Deep Learning
Abstract
TensorExpress provides in-network communication scheduling for distributed deep learning (DDL). In cloud-based DDL, parameter communication over a network is a key bottleneck. Previous studies proposed tensor packet reordering approaches to reduce network blocking time. However, network contention still exists in DDL. TensorExpress mitigates network contention and reduces overall training time. It schedules tensor packets in-network using P4, a switch programming language. TensorExpress improves latency and network blocking time up to 2.5 and 2.44 times, respectively.
Year
DOI
Venue
2020
10.1109/CLOUD49709.2020.00014
2020 IEEE 13th International Conference on Cloud Computing (CLOUD)
Keywords
DocType
ISSN
distributed deep learning,parameter server architecture,P4,communication scheduling,in-network delay
Conference
2159-6182
ISBN
Citations 
PageRank 
978-1-7281-8781-5
1
0.41
References 
Authors
3
4
Name
Order
Citations
PageRank
Minkoo Kang132.44
Gyeongsik Yang256.68
Yeonho Yoo311.42
Chuck Yoo49820.58