Title
STORM: Scalable Resource Management for Large-Scale Parallel Computers
Abstract
Although clusters are a popular form of high-performance computing, they remain more difficult to manage than sequential systems—or even symmetric multiprocessors. In this paper, we identify a small set of primitive mechanisms that are sufficiently general to be used as building blocks to solve a variety of resource-management problems. We then present STORM, a resource-management environment that embodies these mechanisms in a scalable, low-overhead, and efficient implementation. The key innovation behind STORM is a modular software architecture that reduces all resource management functionality to a small number of highly scalable mechanisms. These mechanisms simplify the integration of resource management with low-level network features. As a result of this design, STORM can launch large, parallel applications an order of magnitude faster than the best time reported in the literature and can gang-schedule a parallel application as fast as the node OS can schedule a sequential application. This paper describes the mechanisms and algorithms behind STORM and presents a detailed performance model that shows that STORM's performance can scale to thousands of nodes.
Year
DOI
Venue
2006
10.1109/TC.2006.206
Computers, IEEE Transactions
Keywords
Field
DocType
computer network management,network operating systems,parallel machines,processor scheduling,resource allocation,software architecture,workstation clusters,cluster computing,high-performance computing,large-scale parallel computers,modular software architecture,network operating system,node OS,parallel application gang-scheduling,performance model,scalable resource management environment,sequential application scheduling,sequential system management,symmetric multiprocessor management,Hardware/software interface,and modeling,integration,network operating systems,supercomputers.,system architectures
Resource management,Supercomputer,Computer science,Parallel computing,Network operating system,Real-time computing,Resource allocation,Software architecture,Modular design,Computer cluster,Scalability,Distributed computing
Journal
Volume
Issue
ISSN
55
12
0018-9340
Citations 
PageRank 
References 
5
0.56
26
Authors
4
Name
Order
Citations
PageRank
Eitan Frachtenberg1106085.08
Fabrizio Petrini22050165.82
Juan Fernandez326923.17
Scott Pakin41098134.55