Title
A Message Logging Protocol Based on User Level Failure Mitigation
Abstract
Fault-tolerance and its associated overheads are of great concern for current high performance computing systems and future exascale systems. In such systems, message logging is an important transparent rollback recovery technique considering its beneficial feature of avoiding global restoration process. Most previous work designed and implemented message logging at the library level or even lower software hierarchy. In this paper, we propose a new message logging protocol, which elevates payload copy, failure handling and recovery procedure to the user level to present a better handling of sender-based logging for collective operations and guarantee a certain level of portability. The proposed approach does not record collective communications as a set of point-to-point messages in MPI library; instead, we preserve application data related to the communications to ensure that there exists a process which can serve the original result in case of failure. We implement our protocol in Open MPI and evaluate it by NPB benchmarks on a subsystem of Tianhe-1A. Experimental results outline a improvement on failure free performance and recovery time reduction.
Year
DOI
Venue
2013
10.1007/978-3-319-03859-9_27
ICA3PP
Keywords
DocType
Volume
rollback-recovery,checkpointing,user level,message logging,fault tolerance
Conference
8285 LNCS
Issue
Citations 
PageRank 
PART 1
2
0.38
References 
Authors
12
5
Name
Order
Citations
PageRank
Xunyun Liu120.38
Xinhai Xu2227.73
Ren Xiaoguang3404.17
Yuhua Tang421.05
Ziqing Dai520.38