Title
An analysis of clustered failures on large supercomputing systems
Abstract
Large supercomputers are built today using thousands of commodity components, and suffer from poor reliability due to frequent component failures. The characteristics of failure observed on large-scale systems differ from smaller scale systems studied in the past. One striking difference is that system events are clustered temporally and spatially, which complicates failure analysis and application design. Developing a clear understanding of failures for large-scale systems is a critical step in building more reliable systems and applications that can better tolerate and recover from failures. In this paper, we analyze the event logs of two large IBM Blue Gene systems, statistically characterize system failures, present a model for predicting the probability of node failure, and assess the effects of differing rates of failure on job failures for large-scale systems. The work presented in this paper will be useful for developers and designers seeking to deploy efficient and reliable petascale systems.
Year
DOI
Venue
2009
10.1016/j.jpdc.2009.03.007
J. Parallel Distrib. Comput.
Keywords
Field
DocType
reliable petascale system,semi-markov process,large supercomputing system,frequent component failure,node failure,large ibm blue gene,reliable system,job failure,supercomputer,failure analysis,system failure,large supercomputers,large-scale system,reliability modeling
IBM,Markov process,Supercomputer,Computer science,Blue gene,Failure rate,Fault tolerance,Petascale computing,Reactive system,Distributed computing
Journal
Volume
Issue
ISSN
69
7
Journal of Parallel and Distributed Computing
Citations 
PageRank 
References 
32
1.39
21
Authors
3
Name
Order
Citations
PageRank
Thomas J. Hacker133832.29
Fabian Romero2352.14
Christopher D. Carothers3102261.60