Title
Grouper: Accelerating Hyperparameter Searching in Deep Learning Clusters With Network Scheduling
Abstract
Training a high-accuracy model requires trying hundreds of configurations of hyperparameters to search for the optimal configuration. It is common to launch a group of training jobs (named cojob) with different configurations at the same time and stop the jobs performing worst every stage (i.e., a certain number of iterations). Thus deep learning requires minimizing stage completion time (SCT) to accelerate the searching. To quickly complete the stages, each job in the cojob typically uses multiple GPUs to perform distributed training. The GPUs exchange data per iteration to synchronize their models through the network. However, data transfers of DL jobs compete for network bandwidth since the GPU cluster hosts a number of cojobs from various users, resulting in network congestion and consequently a large SCT for cojobs. Existing flow schedulers aimed at reducing flow/coflow/job completion time mismatch the requirement of hyperparameter searching. In this paper, we implement a system Grouper to minimize average SCT for cojobs. Grouper adopts a well-designed algorithm to permute stages of cojobs and schedules flows from different stages in the order of the permutation. The extensive testbed experiments and simulations show that Grouper outperforms advanced network designs Baraat, Sincrona, and per-flow fair share.
Year
DOI
Venue
2020
10.1109/TNSM.2020.2989187
IEEE Transactions on Network and Service Management
Keywords
DocType
Volume
Deep learning,hyperparameter search,flow scheduling,stage completion time
Journal
17
Issue
ISSN
Citations 
3
1932-4537
0
PageRank 
References 
Authors
0.34
0
3
Name
Order
Citations
PageRank
Pan Zhou112316.76
Hongfang Yu2388.17
Gang Sun346336.98