Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis. - Citegraph

Paper Info

Title
Towards comprehensive dependability-driven resource use and message log-analysis for HPC systems diagnosis.

Abstract
Failure analysis plays an important role in the reliability of data centers and high-performance computing (HPC) systems. Recent work have shown that both resource use data and failure logs can, separately and together, be used to detect system failure-inducing errors and diagnose system failures; the result of error propagation and (unsuccessful) execution of error recovery mechanisms. For more accurate and detailed failure diagnosis, knowledge of error propagation patterns and unsuccessful error recovery is important. To improve system reliability, knowledge of recovery protocols deployment is important. This paper describes and demonstrates application of a new diagnostics framework (CORRMEXT). CORRMEXT analyzes and reports error propagation patterns and degrees of success and failure of error recovery protocols. The steps in the framework are correlations of resource use metrics and error messages, and identification of the earliest times of change of system behaviour. The framework is illustrated with analyses of resource use data and message logs for three HPC systems operated by the Texas Advanced Computing Center (TACC). The illustrations are focused on groups of resource use counters and groups of errors; they reveal many interesting insights into patterns of: (i) network data and software errors, (ii) Lustre file-system and Linux operating system process errors, and (iii) memory and storage errors. We also confirm that: (i) correlations of resource use and errors can only be identified by applying different correlation algorithms, and (ii) the earliest times of change in system behaviour can only be identified by analyzing both the correlated resource use counters and correlated errors. We believe CORRMEXT is the first tool that have diagnosed error propagation paths and error recovery attempts on three different HPC systems. CORRMEXT will be put on the public domain to support systems administrators in diagnosing HPC system failures, on August 2018.

Year	DOI	Venue
2019	10.1016/j.jpdc.2019.05.013	Journal of Parallel and Distributed Computing
Keywords	Field	DocType
Large HPC systems,Correlation,Variance extraction,Error propagation and recovery,Cluster log-data	Dependability,Propagation of uncertainty,Software deployment,Public domain,Computer science,Support system,Software,Network data,Lustre (mineralogy),Distributed computing	Journal
Volume	ISSN	Citations
132	0743-7315	0
PageRank	References	Authors
0.34	0	6

Authors (6 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Edward Chuah	1	34	5.04
Arshad Jhumka	2	361	31.79
Samantha Alt	3	0	2.03
Daniel Balouek-Thomert	4	16	7.84
James C. Browne	5	998	300.57
Manish Parashar	6	3876	343.30

1