Title
Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems
Abstract
The current trend in high performance computing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational power, This, however, also comes with a decrease in the Mean Time To Interrupt because the elements comprising these systems are not becoming significantly more robust. There is substantial evidence that the Mean Time To Interrupt vs. number of processor elements involved is quite similar over a large number of platforms. In this paper we present a system that uses hardware level monitoring coupled with statistical analysis and modeling to select processing system elements based on where they lie in the statistical distribution of similar elements. These characterizations can be used by the scheduler/resource manager to deliver a close to optimal set of processing elements given the available pool and the reliability requirements of the application.
Year
DOI
Venue
2008
10.1109/CCGRID.2008.124
CCGrid
Keywords
Field
DocType
robustness,high performance computing,scheduling,resilience,statistical distribution,resource manager,fault tolerant,statistical distributions,statistical analysis,hardware,resource management
Resource management,Interrupt,Supercomputer,Computer science,Scheduling (computing),Robustness (computer science),Real-time computing,Probability distribution,Probabilistic logic,Interconnection,Distributed computing
Conference
ISSN
ISBN
Citations 
2376-4414
978-0-7695-3156-4
7
PageRank 
References 
Authors
0.94
2
7
Name
Order
Citations
PageRank
Jim M. Brandt1223.04
Bert J. Debusschere211114.65
Ann C. Gentile3377.91
Jackson Mayo4437.97
Philippe P. Pébay527327.36
David C. Thompson670.94
Matthew Wong770.94