Scalable Resource Management in High Performance Computers - Citegraph

Paper Info

Title
Scalable Resource Management in High Performance Computers

Abstract
Clusters of workstations have emerged as an important platform for building cost-effective, scalable, and highly-available computers. Although many hardware solutions are available today, the largest challenge in making largescale clusters usable lies in the system software. In this paper we present STORM, a resource management tool designed to provide scalability, low overhead, and the flexibility necessary to efficiently support and analyze a wide range of job-scheduling algorithms. STORM achieves these feats by using a small set of primitive mechanisms that are common in modern high-performance interconnects. The architecture of STORM is based on three main technical innovations. First, a part of the scheduler runs in the thread processor located on the network interface. Second, we use hardware collectives that are highly scalable both for implementing control heartbeats and to distribute the binary of a parallel job in near-constant time. Third, we use an I/O bypass protocol that allows fast data movements front the file system to the communication buffers in the network interface and vice versa. The experimental results show that STORM can launch a job with a binary of 12 MB on a 64-processor, 32-node cluster in less than 250 ms. This paper provides expert. mental and analytical evidence that these results scale to a much larger number of nodes. To the best of our knowledge, STORM significantly outperforms existing production schedulers in launching jobs, performing resource management tasks, and gang-scheduling tasks.

Year	DOI	Venue
2002	10.1109/CLUSTR.2002.1137759	CLUSTER
Keywords	Field	DocType
file system,system software,user-level communication,quadrics interconnect,cluster computing,results scale,gang scheduling,network interface,resource management task,scalable resource management,hardware solution,resource management tool,job scheduling,parallel architectures,large-scale clusters usable,i/o bypass,performance evaluation,hardware collective,resource management,high performance computers,management,network interfaces,architecture,production,algorithm design and analysis,resource manager,high performance computing,hardware,resource allocation,algorithms,evaluation,workstations,cost effectiveness,job performance,storms,production scheduling,scalability,performance,communications,storm	Resource management,System software,File system,Computer science,Parallel computing,Real-time computing,Resource allocation,Job scheduler,Computer cluster,Distributed computing,Network interface,Scalability	Conference
ISBN	Citations	PageRank
0-7695-1745-5	1	0.36
References	Authors
22	4

Authors (4 rows)

Cited by (1 rows)

References (22 rows)

Name	Order	Citations	PageRank
Eitan Frachtenberg	1	1060	85.08
Fabrizio Petrini	2	2050	165.82
Juan Fernandez	3	269	23.17
Salvador Coll	4	609	57.12

1