Scaling Distributed Machine Learning with In-Network Aggregation. - Citegraph

Paper Info

Title
Scaling Distributed Machine Learning with In-Network Aggregation.

Abstract
Training complex machine learning models in parallel is an increasingly important workload. We accelerate distributed parallel training by designing a communication primitive that uses a programmable switch dataplane to execute a key step of the training process. Our approach, SwitchML, reduces the volume of exchanged data by aggregating the model updates from multiple workers in the network. We co-design the switch processing with the end-host protocols and ML frameworks to provide a robust, efficient solution that speeds up training by up to 300%, and at least by 20% for a number of real-world benchmark models.

Year	Venue	DocType
2019	arXiv: Distributed, Parallel, and Cluster Computing	Journal
Volume	Citations	PageRank
abs/1903.06701	3	0.38
References	Authors
0	10

Authors (10 rows)

Cited by (3 rows)

References (0 rows)

Name	Order	Citations	PageRank
Amedeo Sapio	1	36	4.54
Marco Canini	2	857	60.21
Chen-Yu Ho	3	3	0.72
Jacob Nelson	4	281	17.27
Panos Kalnis	5	3297	141.30
Changhoon Kim	6	1716	121.18
Arvind Krishnamurthy	7	4540	312.24
Masoud Moshref	8	263	13.73
Dan R. K. Ports	9	445	22.52
Peter Richtárik	10	1314	84.53

1