Title
PARIS: Predicting application resilience using machine learning
Abstract
The traditional method to study application resilience to errors in HPC applications uses fault injection (FI), a time-consuming approach. While analytical models have been built to overcome the inefficiencies of FI, they lack accuracy. In this paper, we present PARIS, a machine-learning method to predict application resilience that avoids the time-consuming process of random FI and provides higher prediction accuracy than analytical models. PARIS captures the implicit relationship between application characteristics and application resilience, which is difficult to capture using most analytical models. We overcome many technical challenges for feature construction, extraction, and selection to use machine learning in our prediction approach. Our evaluation on 16 HPC benchmarks shows that PARIS achieves high prediction accuracy. PARIS is up to 450x faster than random FI (49x on average). Compared to the state-of-the-art analytical model, PARIS is at least 63% better in terms of accuracy and has comparable execution time on average.
Year
DOI
Venue
2018
10.1016/j.jpdc.2021.02.015
Journal of Parallel and Distributed Computing
Keywords
DocType
Volume
HPC fault tolerance,Application resilience prediction,Transient faults,Fault injection,Silent data corruption
Journal
152
ISSN
Citations 
PageRank 
0743-7315
1
0.35
References 
Authors
0
3
Name
Order
Citations
PageRank
Luanzheng Guo1101.82
Li, Dong276448.56
Ignacio Laguna323924.56