Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching - Citegraph

Paper Info

Title
Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching

Abstract
Current techniques and systems for distributed model training mostly assume that clusters are comprised of homogeneous servers with a constant resource availability. However, cluster heterogeneity is pervasive in computing infrastructure, and is a fundamental characteristic of low-cost transient resources (such as EC2 spot instances). In this paper, we develop a dynamic batching technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on its resource availability and throughput. Our mini-batch controller seeks to equalize iteration times on all workers, and facilitates training on clusters comprised of servers with different amounts of CPU and GPU resources. This variable mini-batch technique uses proportional control and ideas from PID controllers to find stable mini-batch sizes. Our empirical evaluation shows that dynamic batching can reduce model training times by more than $ 4\times$ on heterogeneous clusters.

Year	DOI	Venue
2020	10.1109/ACSOS49614.2020.00041	2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS)
Keywords	DocType	ISBN
proportional control,PID controllers,mini-batch sizes,heterogeneous clusters,resource heterogeneity,distributed ML training,distributed model training,homogeneous servers,constant resource availability,cluster heterogeneity,low-cost transient resources,EC2 spot instances,dynamic batching,distributed data-parallel training,mini-batch controller,variable mini-batch technique,CPU resources,GPU resources,cloud server	Conference	978-1-7281-7278-1
Citations	PageRank	References
1	0.36	0
Authors
2

Authors (2 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Sahil Tyagi	1	1	0.36
Prateek Sharma	2	11	3.23

1