Abstract | ||
---|---|---|
With the advent of grid computing, more and more high-end computational resources become available for use to a scientist. While this opens up new avenues for scientific research, it makes reliability and fault tolerance of such a system a non-trivial task, especially for long running distributed applications. In order to solve this problem, we present a distributed user-defined checkpointing mechanism within the XCAT3 system. XCAT3 is a framework for component component architecture (CCA) based components consistent with current Grid standards. We describe in detail the algorithms and APIs that are added to XCAT3 in order to support distributed checkpointing. Our approach ensures that the checkpoints are platform independent, minimal in size, and always available during component failures. In addition, our algorithms maintain correctness in the presence of failures and scale well with the number of components, and checkpoint size. |
Year | DOI | Venue |
---|---|---|
2004 | 10.1109/GRID.2004.15 | GRID |
Keywords | Field | DocType |
Internet,application program interfaces,checkpointing,fault tolerance,grid computing,API,XCAT3 system,checkpoint algorithm,component component architecture,distributed application,distributed component,distributed user-defined checkpointing mechanism,fault tolerance,grid computing,grid standard,reliability,restart algorithm,scientific research,Components,Distributed Checkpointing,Grids,Web Services | Architecture,Grid computing,Computer science,Correctness,Real-time computing,Fault tolerance,Web service,Grid,The Internet,Distributed computing | Conference |
ISSN | ISBN | Citations |
1550-5510 | 0-7695-2256-4 | 7 |
PageRank | References | Authors |
0.50 | 9 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Sriram Krishnan | 1 | 448 | 49.29 |
Dennis Gannon | 2 | 2514 | 330.26 |