PARIS: Predicting application resilience using machine learning - Citegraph

Paper Info

Title
PARIS: Predicting application resilience using machine learning

Abstract
The traditional method to study application resilience to errors in HPC applications uses fault injection (FI), a time-consuming approach. While analytical models have been built to overcome the inefficiencies of FI, they lack accuracy. In this paper, we present PARIS, a machine-learning method to predict application resilience that avoids the time-consuming process of random FI and provides higher prediction accuracy than analytical models. PARIS captures the implicit relationship between application characteristics and application resilience, which is difficult to capture using most analytical models. We overcome many technical challenges for feature construction, extraction, and selection to use machine learning in our prediction approach. Our evaluation on 16 HPC benchmarks shows that PARIS achieves high prediction accuracy. PARIS is up to 450x faster than random FI (49x on average). Compared to the state-of-the-art analytical model, PARIS is at least 63% better in terms of accuracy and has comparable execution time on average.

Year	DOI	Venue
2018	10.1016/j.jpdc.2021.02.015	Journal of Parallel and Distributed Computing
Keywords	DocType	Volume
HPC fault tolerance,Application resilience prediction,Transient faults,Fault injection,Silent data corruption	Journal	152
ISSN	Citations	PageRank
0743-7315	1	0.35
References	Authors
0	3

Authors (3 rows)

Cited by (1 rows)

References (0 rows)

Name	Order	Citations	PageRank
Luanzheng Guo	1	10	1.82
Li, Dong	2	764	48.56
Ignacio Laguna	3	239	24.56

1