Title
A fault-tolerance architecture for Kepler-based distributed scientific workflows
Abstract
Fault-tolerance and failure recovery in scientific workflows is still a relatively young topic. The work done in the domain so far mostly applies classic fault-tolerance mechanisms, such as "alternative versions" and "checkpointing", to scientific workflows. Often scientific workflow systems simply rely on the fault-tolerance capabilities provided by their third party subcomponents such as schedulers, Grid resources, or the underlying operating systems. When failures occur at the underlying layers, a workflow system typically sees them only as failed steps in the process without additional detail and the ability of the system to recover from those failures may be limited. In this paper, we present an architecture that tries to address this for Kepler-based scientific workflows by providing more information about failures and faults we have observed, and through a supporting implementation of more comprehensive failure coverage and recovery options. We discuss our framework in the context of the failures observed in two production-level Kepler-based workflows, specifically XGC and S3D. The framework is divided into three major components: (i) a general contingency Kepler actor that provides a recovery block functionality at the workflow level, (ii) an external monitoring module that tracks the underlying workflow components, and monitors the overall health of the workflow execution, and (iii) a checkpointing mechanism that provides smart resume capabilities for cases in which an unrecoverable error occurs. This framework takes advantage of the provenance data collected by the Kepler-based workflows to detect failures and help in fault-tolerance decision making.
Year
DOI
Venue
2010
10.1007/978-3-642-13818-8_31
SSDBM
Keywords
Field
DocType
underlying workflow component,kepler-based workflows,scientific workflows,workflow level,workflow system,fault-tolerance architecture,production-level kepler-based workflows,classic fault-tolerance mechanism,scientific workflow system,kepler-based scientific workflows,workflow execution,fault tolerance,distributed computing,data collection,fault tolerant,operating system
Data mining,Architecture,Workflow technology,Computer science,Fault tolerance,Kepler,Workflow engine,Workflow management system,Workflow,Contingency,Database,Distributed computing
Conference
Volume
ISSN
ISBN
6187
0302-9743
3-642-13817-9
Citations 
PageRank 
References 
10
0.60
13
Authors
5
Name
Order
Citations
PageRank
Pierre Mouallem1263.46
Daniel Crawl224321.02
Ilkay Altintas31191106.09
M. A. Vouk473259.03
Ustun Yildiz5866.76