Title
FaultSight: A Fault Analysis Tool for HPC Researchers
Abstract
System reliability is expected to be a significant challenge for future extreme-scale systems. Poor reliability results in a higher frequency of interruptions in high-performance computer (HPC) applications due to system/application crashes or data corruption due to soft errors. In response, application level error detection and recovery schemes are devised to mitigate the impact of these interruptions. Evaluating these schemes and the reliability of an application re- quires the analysis of thousands of fault injection trials, resulting in tedious and time-consuming process. Furthermore, there is no one data analysis tool that can work with all of the fault injection frameworks currently in use. In this paper, we present FaultSight, a fault injection analysis tool capable of efficiently assisting in the analysis of HPC application reliability as well as the effectiveness of resiliency schemes. FaultSight is designed to be flexible and work with data coming from a variety of fault injection frameworks. The effectiveness of FaultSight is demonstrated by exploring the reliability of different versions of the Matrix-Matrix Multiplication kernel using two different fault injection tools. In addition, the detection and recovery schemes are highlighted for the HPCCG mini-app.
Year
DOI
Venue
2019
10.1109/FTXS49593.2019.00008
2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
Keywords
Field
DocType
soft-error-analysis,fault-analysis-tool,fault-tolerance,resiliency,fault-injection,fault-analysis
Kernel (linear algebra),Fault analysis,Computer science,Error detection and correction,Multiplication,Fault tolerance,Data Corruption,Reliability engineering,Fault injection
Conference
ISBN
Citations 
PageRank 
978-1-7281-6014-6
0
0.34
References 
Authors
0
4
Name
Order
Citations
PageRank
Einar Horn100.34
Dakota Fulp200.34
Jon Calhoun300.34
Luke Olson423521.93