Title
CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems.
Abstract
The use of console logs for error detection in large scale distributed systems has proven to be useful to system administrators. However, such logs are typically redundant and incomplete, making accurate detection very difficult. In an attempt to increase this accuracy, we complement these incomplete console logs with resource usage data, which captures the resource utilisation of every job in the system. We then develop a novel error detection methodology, the CRUDE approach, that makes use of both the resource usage data and console logs. We thus make the following specific technical contributions: we develop (i) a clustering algorithm to group nodes with similar behaviour, (ii) an anomaly detection algorithm to identify jobs with anomalous resource usage, (iii) an algorithm that links jobs with anomalous resource usage with erroneous nodes. We then evaluate our approach using console logs and resource usage data from the Ranger Supercomputer. Our results are positive: (i) our approach detects errors with a true positive rate of about 80%, and (ii) when compared with the well-known Nodeinfo error detection algorithm, our algorithm provides an average improvement of around 85% over Nodeinfo, with a best-case improvement of 250%.
Year
DOI
Venue
2016
10.1109/SRDS.2016.15
Symposium on Reliable Distributed Systems Proceedings
Keywords
Field
DocType
anomaly detection,resource usage data,faults,detection,large-scale HPC systems,unsupervised,event logs
Data mining,Anomaly detection,Supercomputer,Computer science,Error detection and correction,Usage data,Cluster analysis,True positive rate,Distributed computing
Conference
ISSN
Citations 
PageRank 
1060-9857
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Nentawe Gurumdimma131.40
Arshad Jhumka236131.79
Maria Liakata337530.40
Edward Chuah4345.04
James C. Browne5998300.57