Title
Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching
Abstract
Current techniques and systems for distributed model training mostly assume that clusters are comprised of homogeneous servers with a constant resource availability. However, cluster heterogeneity is pervasive in computing infrastructure, and is a fundamental characteristic of low-cost transient resources (such as EC2 spot instances). In this paper, we develop a dynamic batching technique for distributed data-parallel training that adjusts the mini-batch sizes on each worker based on its resource availability and throughput. Our mini-batch controller seeks to equalize iteration times on all workers, and facilitates training on clusters comprised of servers with different amounts of CPU and GPU resources. This variable mini-batch technique uses proportional control and ideas from PID controllers to find stable mini-batch sizes. Our empirical evaluation shows that dynamic batching can reduce model training times by more than $ 4\times$ on heterogeneous clusters.
Year
DOI
Venue
2020
10.1109/ACSOS49614.2020.00041
2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS)
Keywords
DocType
ISBN
proportional control,PID controllers,mini-batch sizes,heterogeneous clusters,resource heterogeneity,distributed ML training,distributed model training,homogeneous servers,constant resource availability,cluster heterogeneity,low-cost transient resources,EC2 spot instances,dynamic batching,distributed data-parallel training,mini-batch controller,variable mini-batch technique,CPU resources,GPU resources,cloud server
Conference
978-1-7281-7278-1
Citations 
PageRank 
References 
1
0.36
0
Authors
2
Name
Order
Citations
PageRank
Sahil Tyagi110.36
Prateek Sharma2113.23