Title
Scaling Distributed Machine Learning with In-Network Aggregation.
Abstract
Training complex machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide a robust, efficient solution that speeds up training by up to 300%, and at least by 20% for a number of real-world benchmark models.
Year
Venue
DocType
2019
arXiv: Distributed, Parallel, and Cluster Computing
Journal
Volume
Citations 
PageRank 
abs/1903.06701
3
0.38
References 
Authors
0
10
Name
Order
Citations
PageRank
Amedeo Sapio1364.54
Marco Canini285760.21
Chen-Yu Ho330.72
Jacob Nelson428117.27
Panos Kalnis53297141.30
Changhoon Kim61716121.18
Arvind Krishnamurthy74540312.24
Masoud Moshref826313.73
Dan R. K. Ports944522.52
Peter Richtárik10131484.53