Title
CODA: Improving Resource Utilization by Slimming and Co-locating DNN and CPU Jobs
Abstract
While deep neural network (DNN) models are often trained on GPUs, many companies and research institutes build GPU clusters that are shared by different groups. On such GPU cluster, DNN training jobs also require CPU cores to run pre-processing, gradient synchronization. Our investigation shows that the number of cores allocated to a training job significantly impact its performance. To this end, we characterize representative deep learning models on their requirement for CPU cores under different GPU resource configurations, and study the sensitivity of these models to other CPU-side shared resources. Based on the characterization, we propose CODA, a scheduling system that is comprised of an adaptive CPU allocator, a real-time contention eliminator, and a multi-array job scheduler. Experimental results show that CODA improves GPU utilization by 20.8% on average without increasing the queuing time of CPU jobs.
Year
DOI
Venue
2020
10.1109/ICDCS47774.2020.00069
2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)
Keywords
DocType
ISSN
DNN training,CPU demand,resource utilization
Conference
1063-6927
ISBN
Citations 
PageRank 
978-1-7281-7003-9
1
0.35
References 
Authors
0
8
Name
Order
Citations
PageRank
Han Zhao181.81
Weihao Cui2133.27
Quan Chen317521.86
Jingwen Leng44912.97
Kai Yu5254.47
Deze Zeng6498.68
Chao Li734437.85
Minyi Guo83969332.25