Title
Interference-driven resource management for GPU-based heterogeneous clusters
Abstract
GPU-based clusters are increasingly being deployed in HPC environments to accelerate a variety of scientific applications. Despite their growing popularity, the GPU devices themselves are under-utilized even for many computationally-intensive jobs. This stems from the fact that the typical GPU usage model is one in which a host processor periodically offloads computationally intensive portions of an application to the coprocessor. Since some portions of code cannot be offloaded to the GPU (for example, code performing network communication in MPI applications), this usage model results in periods of time when the GPU is idle. GPUs could be time-shared across jobs to "fill" these idle periods, but unlike CPU resources such as the cache, the effects of sharing the GPU are not well understood. Specifically, two jobs that time-share a single GPU will experience resource contention and interfere with each other. The resulting slow-down could lead to missed job deadlines. Current cluster managers do not support GPU-sharing, but instead dedicate GPUs to a job for the job's lifetime. In this paper, we present a framework to predict and handle interference when two or more jobs time-share GPUs in HPC clusters. Our framework consists of an analysis model, and a dynamic interference detection and response mechanism to detect excessive interference and restart the interfering jobs on different nodes. We implement our framework in Torque, an open-source cluster manager, and using real workloads on an HPC cluster, show that interference-aware two-job colocation (although our method is applicable to colocating more than two jobs) improves GPU utilization by 25%, reduces a job's waiting time in the queue by 39% and improves job latencies by around 20%.
Year
DOI
Venue
2012
10.1145/2287076.2287091
HPDC
Keywords
Field
DocType
typical gpu usage model,single gpu,gpu-based heterogeneous cluster,jobs time-share gpus,gpu-based cluster,job latency,gpu device,computationally-intensive job,hpc cluster,interference-driven resource management,gpu utilization,job deadline,cluster,time sharing,co processor,resource manager,scheduling,interference
Resource management,GPU cluster,Cache,Computer science,CUDA,Scheduling (computing),Queue,Parallel computing,Real-time computing,Interference (wave propagation),Coprocessor,Distributed computing
Conference
Citations 
PageRank 
References 
18
0.66
20
Authors
5
Name
Order
Citations
PageRank
Rajat Phull1301.28
Cheng-Hong Li2795.98
Kunal Rao3333.03
Hari Cadambi4180.66
Srimat T. Chakradhar52492185.94