Grouper: Accelerating Hyperparameter Searching in Deep Learning Clusters With Network Scheduling - Citegraph

Paper Info

Title
Grouper: Accelerating Hyperparameter Searching in Deep Learning Clusters With Network Scheduling

Abstract
Training a high-accuracy model requires trying hundreds of configurations of hyperparameters to search for the optimal configuration. It is common to launch a group of training jobs (named cojob) with different configurations at the same time and stop the jobs performing worst every stage (i.e., a certain number of iterations). Thus deep learning requires minimizing stage completion time (SCT) to accelerate the searching. To quickly complete the stages, each job in the cojob typically uses multiple GPUs to perform distributed training. The GPUs exchange data per iteration to synchronize their models through the network. However, data transfers of DL jobs compete for network bandwidth since the GPU cluster hosts a number of cojobs from various users, resulting in network congestion and consequently a large SCT for cojobs. Existing flow schedulers aimed at reducing flow/coflow/job completion time mismatch the requirement of hyperparameter searching. In this paper, we implement a system Grouper to minimize average SCT for cojobs. Grouper adopts a well-designed algorithm to permute stages of cojobs and schedules flows from different stages in the order of the permutation. The extensive testbed experiments and simulations show that Grouper outperforms advanced network designs Baraat, Sincrona, and per-flow fair share.

Year	DOI	Venue
2020	10.1109/TNSM.2020.2989187	IEEE Transactions on Network and Service Management
Keywords	DocType	Volume
Deep learning,hyperparameter search,flow scheduling,stage completion time	Journal	17
Issue	ISSN	Citations
3	1932-4537	0
PageRank	References	Authors
0.34	0	3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Pan Zhou	1	123	16.76
Hongfang Yu	2	38	8.17
Gang Sun	3	463	36.98

1