Title
Improving workflow fault tolerance through provenance-based recovery
Abstract
Scientific workflow systems frequently are used to execute a variety of long-running computational pipelines prone to premature termination due to network failures, server outages, and other faults. Researchers have presented approaches for providing fault tolerance for portions of specific workflows, but no solution handles faults that terminate the workflow engine itself when executing a mix of stateless and stateful workflow components. Here we present a general framework for efficiently resuming workflow execution using information commonly captured by workflow systems to record data provenance. Our approach facilitates fast workflow replay using only such commonly recorded provenance data. We also propose a checkpoint extension to standard provenance models to significantly reduce the computation needed to reset the workflow to a consistent state, thus resulting in much shorter reexecution times. Our work generalizes the rescue-DAG approach used by DAGMan to richer workflow models that may contain stateless and stateful multi-invocation actors as well as workflow loops.
Year
DOI
Venue
2011
10.1007/978-3-642-22351-8_12
SSDBM
Keywords
Field
DocType
provenance data,workflow system,workflow engine,data provenance,workflow loop,richer workflow model,workflow replay,scientific workflow system,provenance-based recovery,workflow execution,improving workflow fault tolerance,stateful workflow component
Data mining,Workflow technology,Computer science,Fault tolerance,Stateful firewall,Workflow engine,Workflow management system,Workflow,Stateless protocol,Database,Distributed computing
Conference
Citations 
PageRank 
References 
10
0.56
18
Authors
5
Name
Order
Citations
PageRank
Sven Köhler117213.47
Sean Riddle2865.57
Daniel Zinn319813.43
Timothy McPhillips426214.14
Bertram Ludäscher51879239.67