Title
JPAS: Job-progress-aware flow scheduling for deep learning clusters
Abstract
Deep learning (DL) is an increasingly important tool for large-scale data analytics and DL workloads are also common in today's production clusters due to the increasing number of deep-learning-driven services (e.g., online search and speech recognition). To handle ever-growing training datasets, it is common to conduct distributed DL (DDL) training to leverage multiple machines in parallel. Training DL models in parallel can incur significant bandwidth contention on shared clusters. As a result, the network is a well-known bottleneck for distributed training. Efficient network scheduling is essential for maximizing the performance of DL training. DL training is feedback-driven exploration (e.g., hyper-parameter tuning, model structure optimization), which requires multiple retrainings of deep learning models that differ in terms of their configuration. The information at the early stage of each retraining can facilitate the direct search for high-quality models. Thus, reducing the early-stage time can accelerate the exploration of DL training. In this paper, we propose JPAS, which is a flow scheduling system for DDL training jobs that aims at reducing the early-stage time. JPAS uses a simple greedy mechanism to periodically order all DDL jobs. Each host machine sets priorities for its flows using the corresponding job order and offloads the flow scheduling and rate allocation to the underlying priority-enabled network. We evaluate JPAS over a real testbed that is composed of 13 servers and a commodity switch. The evaluation results demonstrate that JPAS can reduce the time to reach 90% or 95% of the converged accuracy by up to 38%. Hence, JPAS can remarkably reduce the early-stage time and thus accelerate the search for high-quality models.
Year
DOI
Venue
2020
10.1016/j.jnca.2020.102590
Journal of Network and Computer Applications
Keywords
DocType
Volume
Machine learning,Deep learning,Flow scheduling,Job progress aware
Journal
158
ISSN
Citations 
PageRank 
1084-8045
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Pan Zhou112316.76
Xinshu He200.34
Shouxi Luo322.75
Hongfang Yu4388.17
Gang Sun546336.98