An analysis of clustered failures on large supercomputing systems - Citegraph

Paper Info

Title
An analysis of clustered failures on large supercomputing systems

Abstract
Large supercomputers are built today using thousands of commodity components, and suffer from poor reliability due to frequent component failures. The characteristics of failure observed on large-scale systems differ from smaller scale systems studied in the past. One striking difference is that system events are clustered temporally and spatially, which complicates failure analysis and application design. Developing a clear understanding of failures for large-scale systems is a critical step in building more reliable systems and applications that can better tolerate and recover from failures. In this paper, we analyze the event logs of two large IBM Blue Gene systems, statistically characterize system failures, present a model for predicting the probability of node failure, and assess the effects of differing rates of failure on job failures for large-scale systems. The work presented in this paper will be useful for developers and designers seeking to deploy efficient and reliable petascale systems.

Year	DOI	Venue
2009	10.1016/j.jpdc.2009.03.007	J. Parallel Distrib. Comput.
Keywords	Field	DocType
reliable petascale system,semi-markov process,large supercomputing system,frequent component failure,node failure,large ibm blue gene,reliable system,job failure,supercomputer,failure analysis,system failure,large supercomputers,large-scale system,reliability modeling	IBM,Markov process,Supercomputer,Computer science,Blue gene,Failure rate,Fault tolerance,Petascale computing,Reactive system,Distributed computing	Journal
Volume	Issue	ISSN
69	7	Journal of Parallel and Distributed Computing
Citations	PageRank	References
32	1.39	21
Authors
3

Authors (3 rows)

Cited by (32 rows)

References (21 rows)

Name	Order	Citations	PageRank
Thomas J. Hacker	1	338	32.29
Fabian Romero	2	35	2.14
Christopher D. Carothers	3	1022	61.60

1