Title
Distem: Evaluation of Fault Tolerance and Load Balancing Strategies in Real HPC Runtimes through Emulation
Abstract
The era of Exascale computing raises new challenges for HPC. Intrinsic characteristics of those extreme scale platforms bring energy and reliability issues. To cope with those constraints, applications will have to be more flexible in order to deal with platform geometry evolutions and unavoidable failures. Thus, to prepare for this upcoming era, a strong effort must be made on improving the HPC software stack. This work focuses on improving the study of a central part of the software stack, the HPC runtimes. To this end we propose a set of extensions to the Distem emulator that enable the evaluation of fault tolerance and load balancing mechanisms in such runtimes. Extensive experimentation showing the benefits of our approach has been performed with three HPC runtimes: Charm++, MPICH, and OpenMPI.
Year
DOI
Venue
2016
10.1109/CCGrid.2016.35
2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
Keywords
Field
DocType
Experimentation,HPC runtimes,Fault tolerance,Load balancing,Emulation
Exascale computing,MPICH,Extreme scale,Load balancing (computing),Computer science,Software fault tolerance,Real-time computing,Software,Fault tolerance,Emulation,Distributed computing
Conference
ISSN
ISBN
Citations 
2376-4414
978-1-5090-2454-4
0
PageRank 
References 
Authors
0.34
13
4
Name
Order
Citations
PageRank
Cristian Ruiz1302.75
Joseph Emeras2182.94
Emmanuel Jeanvoine3836.75
Lucas Nussbaum414515.18