Interference-driven resource management for GPU-based heterogeneous clusters - Citegraph

Paper Info

Title
Interference-driven resource management for GPU-based heterogeneous clusters

Abstract
GPU-based clusters are increasingly being deployed in HPC environments to accelerate a variety of scientific applications. Despite their growing popularity, the GPU devices themselves are under-utilized even for many computationally-intensive jobs. This stems from the fact that the typical GPU usage model is one in which a host processor periodically offloads computationally intensive portions of an application to the coprocessor. Since some portions of code cannot be offloaded to the GPU (for example, code performing network communication in MPI applications), this usage model results in periods of time when the GPU is idle. GPUs could be time-shared across jobs to "fill" these idle periods, but unlike CPU resources such as the cache, the effects of sharing the GPU are not well understood. Specifically, two jobs that time-share a single GPU will experience resource contention and interfere with each other. The resulting slow-down could lead to missed job deadlines. Current cluster managers do not support GPU-sharing, but instead dedicate GPUs to a job for the job's lifetime. In this paper, we present a framework to predict and handle interference when two or more jobs time-share GPUs in HPC clusters. Our framework consists of an analysis model, and a dynamic interference detection and response mechanism to detect excessive interference and restart the interfering jobs on different nodes. We implement our framework in Torque, an open-source cluster manager, and using real workloads on an HPC cluster, show that interference-aware two-job colocation (although our method is applicable to colocating more than two jobs) improves GPU utilization by 25%, reduces a job's waiting time in the queue by 39% and improves job latencies by around 20%.

Year	DOI	Venue
2012	10.1145/2287076.2287091	HPDC
Keywords	Field	DocType
typical gpu usage model,single gpu,gpu-based heterogeneous cluster,jobs time-share gpus,gpu-based cluster,job latency,gpu device,computationally-intensive job,hpc cluster,interference-driven resource management,gpu utilization,job deadline,cluster,time sharing,co processor,resource manager,scheduling,interference	Resource management,GPU cluster,Cache,Computer science,CUDA,Scheduling (computing),Queue,Parallel computing,Real-time computing,Interference (wave propagation),Coprocessor,Distributed computing	Conference
Citations	PageRank	References
18	0.66	20
Authors
5

Authors (5 rows)

Cited by (18 rows)

References (20 rows)

Name	Order	Citations	PageRank
Rajat Phull	1	30	1.28
Cheng-Hong Li	2	79	5.98
Kunal Rao	3	33	3.03
Hari Cadambi	4	18	0.66
Srimat T. Chakradhar	5	2492	185.94

1