Title
Online failure prediction for HPC resources using decentralized clustering
Abstract
Ensuring high reliability of large-scale clusters is becoming more critical as the size of these machines continues to grow, since this increases the complexity and amount of interactions between different nodes and thus results in a high failure frequency. For this reason, predicting node failures in order to prevent errors from happening in the first place has become extremely valuable. A common approach for failure prediction is to analyze traces of system events to find correlations between event types or anomalous event patterns and node failures, and to use the types or patterns identified as failure predictors at run-time. However, typical centralized solutions for failure prediction in this manner suffer from high transmission and processing overheads at very large scales. We present a solution to the problem of predicting compute node soft-lockups in large scale clusters by using a decentralized online clustering algorithm (DOC) to detect anomalies in resource usage logs, which have been shown to correlate to particular types of node failures in supercomputer clusters. We demonstrate the effectiveness of this system by using the monitoring logs from the Ranger supercomputer at Texas Advanced Computing Center. Experiments shows that this approach can achieve similar accuracy as other related approaches, while maintaining low RAM and bandwidth usage, with a runtime impact to current running applications of less than 2%.
Year
DOI
Venue
2014
10.1109/HiPC.2014.7116903
International Conference on High Performance Computing
Keywords
Field
DocType
Failure prediction, Monitoring, Large-scale systems, HPC, Clustering
Data mining,Supercomputer,Computer science,Parallel computing,Bandwidth (signal processing),Prediction algorithms,Distributed database,Cluster analysis,Overhead (business),Distributed computing
Conference
ISSN
ISBN
Citations 
1094-7256
978-1-4799-5975-4
4
PageRank 
References 
Authors
0.40
12
5
Name
Order
Citations
PageRank
Alejandro Pelaez170.79
Andres Quiroz240.40
James C. Browne36221.52
Edward Chuah440.40
Manish Parashar53876343.30