Title
Using Performance Tools to Support Experiments in HPC Resilience.
Abstract
The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience experimentation are greatly lacking. However, we argue that there are several parallels between "performance tools" and "resilience tools". As such, we believe the rich set of HPC performance-focused tools can be extended (repurposed) to benefit the resilience community. In this paper, we describe the initial motivation to leverage standard HPC performance analysis techniques to aid in developing diagnostic tools to assist fault tolerance experiments for HPC applications. These diagnosis procedures help to provide context for the system when the errors (failures) occurred. We describe our initial work in leveraging an MPI performance trace tool to assist in providing global context during fault injection experiments. Such tools will assist the HPC resilience community as they extend existing and new application codes to support fault tolerance.
Year
DOI
Venue
2013
10.1007/978-3-642-54420-0_71
Lecture Notes in Computer Science
Field
DocType
Volume
Psychological resilience,Parallels,Supercomputer,Computer science,Parallel computing,Message Passing Interface,Fault tolerance,Message passing,Distributed computing
Conference
8374
ISSN
Citations 
PageRank 
0302-9743
0
0.34
References 
Authors
7
4
Name
Order
Citations
PageRank
Thomas Naughton1768.79
Swen Böhm2201.51
Christian Engelmann395360.46
Geoffroy Vallée412315.62