Title
Linux Support for Fast Transparent General Purpose Checkpoint/Restart of Multithreaded Processes in Loadable Kernel Module
Abstract
Checkpoint/Restart is the ability to save the state of a running application so that it can later resume its execution from the time of the checkpoint. These are techniques with many potential applications, including establishment of a fault-tolerant environment, improving system resource utilization, and true migration of a process. With increasing hardware speed and size of clusters the average time between failures has been reduced. Therefore, fault tolerance and ability to checkpoint a process have become inevitable. Almost all platforms deployed for high-performance computing support process checkpoint/restart. Linux as one of the popular operating systems does not provide a general purpose implementation. Some are limited to specific type of parallel programming library, confined to some unique well-behaved type of applications, or reliant on specific features in kernel which could be missing on many occasions. Most of implementations demand elaborate practice of recompiling a whole kernel to apply required patches. In this paper, we describe the design and implementation of multithreaded process checkpoint/restart system for Linux which provide capability of dynamic extension to increase compatibility and reduce system overhead. It does not impose any requirement on the existence of a special facility in the operating system and can do checkpoint/restart of an application independent of their behavior and fully transparent. The entire system is absolutely implemented in multiple kernel loadable modules, which result in ease of use and eliminate the burden of complex system administration.
Year
DOI
Venue
2013
10.1007/s10723-013-9248-5
J. Grid Comput.
Keywords
Field
DocType
General purpose,Transparent,Multithreaded process checkpoint/restart,Loadable kernel module
Kernel (linear algebra),Dynamic Extension,General purpose,Computer science,Usability,Real-time computing,Implementation,Fault tolerance,Loadable kernel module,Operating system,Distributed computing
Journal
Volume
Issue
ISSN
11
2
1570-7873
Citations 
PageRank 
References 
3
0.37
22
Authors
3
Name
Order
Citations
PageRank
Amirreza Zarrabi1152.26
Khairulmizam Samsudin29213.43
Wan Azizun Wan Adnan3174.07