Title
Olympian: Scheduling GPU Usage in a Deep Neural Network Model Serving System.
Abstract
Deep neural networks (DNNs) are emerging as important drivers for GPU (Graphical Processing Unit) usage. Routinely, now, cloud offerings include GPU-capable VMs, and GPUs are used for training and testing DNNs. A popular way to run inference (or testing) tasks with DNNs is to use middleware called a serving system. Tensorflow-Serving (TF-Serving) is an example of a DNN serving system. In this paper, we consider the problem of carefully scheduling multiple concurrent DNNs in a serving system on a single GPU to achieve fairness or service differentiation objectives, a capability crucial to cloud-based TF-Serving offerings. In scheduling DNNs, we face two challenges: how to schedule, and switch between, different DNN jobs at low overhead; and, how to account for their usage. Our system, Olympian, extends TF-Serving to enable fair sharing of a GPU across multiple concurrent large DNNs at low overhead, a capability TF-Serving by itself is not able to achieve. Specifically, Olympian can run concurrent instances of several large DNN models such as Inception, ResNet, GoogLeNet, AlexNet and VGG, provide each with an equal share of the GPU, while interleaving them at timescales of 1-2 ms, and incurring an overhead of less than 2%. It achieves this by leveraging the predictability of GPU computations to profile GPU resource usage models offline, then using these to achieve low overhead switching between DNNs.
Year
DOI
Venue
2018
10.1145/3274808.3274813
Middleware '18: 19th International Middleware Conference Rennes France December, 2018
Field
DocType
ISBN
Middleware,Predictability,Inference,Scheduling (computing),Computer science,Artificial neural network,Interleaving,Computation,Cloud computing,Distributed computing
Conference
978-1-4503-5702-9
Citations 
PageRank 
References 
0
0.34
17
Authors
4
Name
Order
Citations
PageRank
Yitao Hu100.68
Swati Rallapalli220213.89
Bongjun Ko326821.18
ramesh govindan4154302144.86