Monitoring strategies for scalable dynamic checkpointing - Citegraph

Paper Info

Title
Monitoring strategies for scalable dynamic checkpointing

Abstract
Resilience is an important challenge for extreme-scale supercomputers. Failures in current supercomputers are assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. The detection of those periods is important in order to adjust the system to new conditions. In this paper we present a monitoring system that listens to hardware events across computing nodes and forwards important events to the fault tolerance runtime so it can react to those regime changes. Our evaluation at scale shows several aspects of this dynamic checkpointing scheme, critical to understanding its applicability on production systems, as well as to identifying possible avenues for future improvements. In particular, we evaluate the ability of our system to monitor as many types of events as possible, measure their importance, and forward them to the resilience runtime.

Year	DOI	Venue
2016	10.1109/IGCC.2016.7892626	2016 Seventh International Green and Sustainable Computing Conference (IGSC)
Keywords	Field	DocType
Supercomputers,Fault Tolerance,Resilience,Introspective Systems,Failures,High-Performance Computing	Psychological resilience,Monitoring system,Supercomputer,Computer science,Fault tolerance,Computing systems,Distributed computing,Scalability	Conference
ISBN	Citations	PageRank
978-1-5090-5118-2	0	0.34
References	Authors
15	2

Authors (2 rows)

Cited by (0 rows)

References (15 rows)

Name	Order	Citations	PageRank
Swann Perarnau	1	72	6.84
Leonardo Bautista-Gomez	2	148	11.33

1