Title
Using Resource Use Data and System Logs for HPC System Error Propagation and Recovery Diagnosis
Abstract
Analyzing failures is important for the reliability of HPC systems and failure diagnosis based only on system logs is incomplete. Resource use data - made available recently - is another potential source of data for failure analysis. Recent work that combines analysis of system logs with resource use data show promising results. In this paper, we describe a new workflow for combining system resource usage and failure logs for diagnosis. The workflow - called EXERMEST - identifies significant system counters and events then correlates them to failures and recovery. We apply EXERMEST on the Ranger HPC system cluster log-data and show that it improves diagnosis over previous research. EXERMEST: (i) show that more system counters and errors can be identified only by applying more feature extractors, (ii) identify CPU I/O bottlenecks and Lustre client eviction, (iii) identify network packet drops and Lustre I/O errors, (iv) identify virtual memory and harddisk I/O errors, (v) show that time-bins of different granularities are required for identifying the errors. EXERMEST is available on the public domain for supporting system administrators in failure diagnosis.
Year
DOI
Venue
2019
10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00072
2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)
Keywords
DocType
ISBN
HPC,Feature extraction,Correlation,Error propagation and recovery,Resource use data and system logs
Conference
978-1-7281-4329-3
Citations 
PageRank 
References 
0
0.34
0
Authors
7
Name
Order
Citations
PageRank
Edward Chuah1345.04
Arshad Jhumka215.42
Samantha Alt302.03
Juan J. Villalobos402.03
Joshua Fryman500.34
William Barth600.34
Manish Parashar73876343.30