Title
Establishing Hypothesis for Recurrent System Failures from Cluster Log Files
Abstract
A goal for the analysis of supercomputer logs is to establish causal relationships among events which reflect significant state changes in the system. Establishing these relationships is at the heart of failure diagnosis. In principle, a log analysis tool could automate many of the manual steps systems administrators must currently use to diagnose system failures. However, supercomputer logs are unstructured, incomplete and contain considerable ambiguity so that direct discovery of causal relationships is difficult. This paper describes the second generation FDiag log-based failure diagnostics framework that provides automation of the manual failure diagnosis process and determines with high confidence, the likely cause of the failure, the components involved and the event sequences which contain the times of the causal and terminal events. FDiag extracts relevant events from the system logs, performs correlation analysis on these events and from these correlations determines the components involved and the event sequences. The diagnostics capabilities of FDiag are validated by comparing its assessments on known instances of recurrent failures on the Ranger supercomputer at the University of Texas at Austin. We believe FDiag is the first log analyzer to demonstrate this level of diagnostics capability from the system logs of an open source software stack incorporating Linux and the Lustre file system. FDiag will be put into production use for support of failure diagnosis on Ranger in September, 2011.
Year
DOI
Venue
2011
10.1109/DASC.2011.27
DASC
Keywords
Field
DocType
lustre file system,causal relationship,recurrent system failures,log-based failure diagnostics framework,diagnostics capability,generation fdiag,failure diagnosis,cluster log files,manual failure diagnosis process,supercomputer log,system log,event sequence,syslogs,servers,hypothesis testing,reliability,hypothesis test,system monitoring,correlation,protocols,data mining
Data mining,Supercomputer,Computer science,Server,System monitoring,Automation,Lustre (file system),Ambiguity,Statistical hypothesis testing,Correlation analysis
Conference
Citations 
PageRank 
References 
6
0.72
19
Authors
8
Name
Order
Citations
PageRank
Edward Chuah1345.04
Gary Lee2394.74
William-Chandra Tjhi315610.09
Shyh-Hao Kuo4585.84
Terence Hung5181.67
John Hammond6515.07
Tommy Minyard7354.44
James C. Browne8998300.57