Abstract | ||
---|---|---|
Current techniques and systems for distributed model training mostly assume that clusters are comprised of homogeneous servers with a constant resource availability. However, cluster heterogeneity is pervasive in computing infrastructure, and is a fundamental characteristic of low-cost transient resources (such as EC2 spot instances). In this paper, we develop a dynamic batching technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on its resource availability and throughput. Our mini-batch controller seeks to equalize iteration times on all workers, and facilitates training on clusters comprised of servers with different amounts of CPU and GPU resources. This variable mini-batch technique uses proportional control and ideas from PID controllers to find stable mini-batch sizes. Our empirical evaluation shows that dynamic batching can reduce model training times by more than $ 4\times$ on heterogeneous clusters. |
Year | DOI | Venue |
---|---|---|
2020 | 10.1109/ACSOS49614.2020.00041 | 2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS) |
Keywords | DocType | ISBN |
proportional control,PID controllers,mini-batch sizes,heterogeneous clusters,resource heterogeneity,distributed ML training,distributed model training,homogeneous servers,constant resource availability,cluster heterogeneity,low-cost transient resources,EC2 spot instances,dynamic batching,distributed data-parallel training,mini-batch controller,variable mini-batch technique,CPU resources,GPU resources,cloud server | Conference | 978-1-7281-7278-1 |
Citations | PageRank | References |
1 | 0.36 | 0 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Sahil Tyagi | 1 | 1 | 0.36 |
Prateek Sharma | 2 | 11 | 3.23 |