Title
Canaries in a Coal Mine: Using Application-Level Checkpoints to Detect Memory Failures.
Abstract
Memory failures in future extreme scale applications are a significant concern in the high-performance computing community and have attracted much research attention. We contend in this paper that using application checkpoint data to detect memory failures has potential benefits and is preferable to examining application memory. To support this contention, we describe the application of machine learning techniques to evaluate the veracity of checkpoint data. Our preliminary results indicate that supervised decision tree machine learning approaches can effectively detect corruption in restart files, suggesting that future extreme-scale applications and systems may benefit from incorporating such approaches in order to cope with memory failures.
Year
DOI
Venue
2015
10.1007/978-3-319-27308-2_54
Lecture Notes in Computer Science
Field
DocType
Volume
Extreme scale,Computer science,Coal mining,Decision tree learning,Corruption,Distributed computing
Conference
9523
ISSN
Citations 
PageRank 
0302-9743
2
0.39
References 
Authors
0
4
Name
Order
Citations
PageRank
Patrick Widener123222.39
Kurt Ferreira263940.78
Scott Levy3273.09
Nathan Fabian41078.00