Title
Quantifying effectiveness of failure prediction and response in HPC systems: Methodology and example
Abstract
Effective failure prediction and mitigation strategies in high-performance computing systems could provide huge gains in resilience of tightly coupled large-scale scientific codes. These gains would come from prediction-directed process migration and resource servicing, intelligent resource allocation, and checkpointing driven by failure predictors rather than at regular intervals based on nominal mean time to failure. Given probabilistic associations of outlier behavior in hardware-related metrics with eventual failure in hardware, system software, and/or applications, this paper explores approaches for quantifying the effects of prediction and mitigation strategies and demonstrates these using actual production system data. We describe context-relevant methodologies for determining the accuracy and cost-benefit of predictors.
Year
DOI
Venue
2010
10.1109/DSNW.2010.5542629
Dependable Systems and Networks Workshops
Keywords
Field
DocType
mitigation strategy,system software,eventual failure,hpc system,context-relevant methodology,resource servicing,high-performance computing system,effective failure prediction,intelligent resource allocation,actual production system data,hardware-related metrics,quantifying effectiveness,distributed processing,process migration,production system,accuracy,computational modeling,operating systems,memory management,measurement,resource allocation,mean time to failure,statistical analysis
Mean time between failures,Psychological resilience,System software,Computer science,Process migration,Outlier,Real-time computing,Memory management,Resource allocation,Probabilistic logic,Reliability engineering,Distributed computing
Conference
ISBN
Citations 
PageRank 
978-1-4244-7728-9
1
0.35
References 
Authors
6
9
Name
Order
Citations
PageRank
James Brandt120.71
Frank Chen210.35
Vincent De Sapio3556.78
Ann C. Gentile4377.91
Jackson Mayo5437.97
Philippe P. Pébay627327.36
Diana Roe71348.01
David C. Thompson830818.14
Matthew Wong910.35