On the Feasibility of Incremental Checkpointing for Scientific Computing. - Citegraph

Paper Info

Title
On the Feasibility of Incremental Checkpointing for Scientific Computing.

Abstract
In the near future large-scale parallel computers will feature hundreds of thousands of processing nodes. In such systems, fault tolerance is critical as failures will occur very often. Checkpointing and rollback recovery has been extensively studied as an attempt to provide fault tolerance. However, current implementations do not provide the total transparency and full flexibility that are necessary to support the new paradigm of autonomic computing-systems able to self-heal and self-repair. In this paper we provide an in-depth evaluation of incremental checkpointing for scientific computing. The experimental results, obtained on a state-of-the art cluster running several scientific applications, show that efficient, scalable, automatic and user-transparent incremental checkpointing is within reach with current technology.

Year	DOI	Venue
2004	10.1109/IPDPS.2004.1302982	IPDPS
Keywords	Field	DocType
parallel computer,scientific computing,fault tolerance,hardware,high performance computing,computer networks,concurrent computing,application software,autonomic computing,fault tolerant	Transparency (graphic),Autonomic computing,System recovery,Computer science,Parallel computing,Implementation,Computational science,Fault tolerance,Rollback recovery,Distributed computing,Scalability	Conference
Citations	PageRank	References
25	1.64	18
Authors
5

Authors (5 rows)

Cited by (25 rows)

References (18 rows)

Name	Order	Citations	PageRank
José Carlos Sancho	1	382	29.97
Fabrizio Petrini	2	2050	165.82
Greg Johnson	3	25	1.64
Juan Fernandez	4	269	23.17
Eitan Frachtenberg	5	1060	85.08

1