Title
FTI: high performance fault tolerance interface for hybrid systems
Abstract
Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 petaflops runs (1152 GPUs) while checkpointing at high frequency.
Year
DOI
Venue
2011
10.1145/2063384.2063427
SC
Keywords
Field
DocType
three-level checkpoint scheme,low-overhead high-frequency multi-level checkpoint,checkpoint file,execution time,fault tolerance interface fti,hybrid system,new fault tolerant technique,checkpoint overhead,high performance fault tolerance,encoding time,performance model,case study,fault tolerance,user interfaces,writing,fault tolerant,fault tolerant system,computer model,compression,reed solomon,computational modeling,topology,data reduction,encoding,high frequency,data intensive computing
Earthquake simulation,Data-intensive computing,Computer science,Correctness,Parallel computing,Fault tolerance,User interface,Petascale computing,Hybrid system,Embedded system,Distributed computing,Encoding (memory)
Conference
Citations 
PageRank 
References 
115
3.64
27
Authors
6
Search Limit
100115
Name
Order
Citations
PageRank
Leonardo Bautista-Gomez114811.33
Seiji Tsuboi21153.64
Dimitri Komatitsch333922.87
Franck Cappello43775251.47
Naoya Maruyama583655.34
Satoshi Matsuoka63773359.36