Using replication and checkpointing for reliable task management in computational Grids - Citegraph

Paper Info

Title
Using replication and checkpointing for reliable task management in computational Grids

Abstract
In large-scale Grid computing environments, providing fault-tolerance is required for both scientific computation and file-sharing to increase their reliability. In previous works, several mechanisms were proposed for the Grids or distributed computing systems. However, some of them used only space redundancy (hardware replication), and others used only time redundancy (checkpointing and rollback). For this reason, the existing mechanisms are inefficient in terms of their resource utilization on the Grids. The main goal of ART is reducing the number of replications by using checkpointing and rollback scheme for each replication. In ART, the minimum number of replications is adaptively selected based on analysis of probability of successful execution within the given deadline and reliability requirement of each task. Our simulation results show that ART can significantly reduce the number of replications and improve scalability compared with existing mechanisms.

Year	DOI	Venue
2010	10.1109/HPCS.2010.5547140	HPCS
Keywords	Field	DocType
replication,real-time,checkpointing,computational grids,reliability,density functional theory,grid computing,probabilistic logic,real time,software fault tolerance,file sharing,fault tolerant,resource utilization,redundancy,scientific computing,fault tolerance	Grid computing,Task management,Computer science,Parallel computing,Software fault tolerance,Redundancy (engineering),Fault tolerance,Probabilistic logic,Rollback,Distributed computing,Scalability	Conference
ISBN	Citations	PageRank
978-1-4244-6827-0	1	0.36
References	Authors
12	5

Authors (5 rows)

Cited by (1 rows)

References (12 rows)

Name	Order	Citations	PageRank
Sangho Yi	1	538	35.84
Derrick Kondo	2	1541	82.99
Bongjae Kim	3	15	7.10
Geunyoung Park	4	26	3.74
Yookun Cho	5	1544	162.03

1