Title
Design and evaluation of a self-healing Kepler for scientific workflows
Abstract
Kepler is a popular open source scientific workflow (SWF) as it simplifies the effort required to construct complex data flow models through a visual interface. As the complexity of the workflow applications that will run on heterogeneous distributed systems increases, fault management becomes a critical design issue for large scale scientific and engineering applications. Due to the long execution times of these applications, it is important that they are fault tolerant; i.e. the workflow application can recover gracefully from faults without the need to restart the application from the beginning. The current implementation of Kepler tool does not support fault tolerance or recovery mechanisms. In this paper, we extend the Kepler capabilities to support fault tolerant scientific workflow (FT-SWF) with a checkpoint mechanism where corrective measures are taken seamlessly in an autonomic manner whenever a fault is detected. To the best of our knowledge, this is the first approach on adding autonomic operations to Kepler. We have evaluated the FT-Kepler on a distributed application used by ecosystem researchers. We evaluated the performance of the workflow with hardware and software based fault scenarios in terms of execution time, recovery time, and the checkpoint mechanism overhead. The experimental evaluations indicate that the checkpoint mechanism adds negligible overhead to the total execution time of the workflow and as the fault rate increases, the number of checkpoints should be increased.
Year
DOI
Venue
2010
10.1145/1851476.1851525
HPDC
Keywords
Field
DocType
scientific workflows,fault rate increase,workflow application,kepler capability,fault scenario,scientific workflow,self-healing kepler,fault tolerant,fault tolerant scientific workflow,fault tolerance,fault management,checkpoint mechanism,distributed application,complex data,kepler
Workflow technology,Computer science,Real-time computing,Fault management,Fault tolerance,Software,Workflow application,Workflow engine,Workflow,Workflow management system,Distributed computing
Conference
Citations 
PageRank 
References 
1
0.37
5
Authors
5
Name
Order
Citations
PageRank
Arjun Hary110.37
Ali Akoglu215729.40
Youssif Alnashif3887.17
Salim Hariri42593184.23
Darrel Jenerette510.37