Title
Approaches for Implementing Persistent Queues within Data-Intensive Scientific Workflows
Abstract
Many scientific workflow systems are built on dataflow-based models of computation in which data drives the execution of workflow components. An advantage of using dataflow models is their straightforward semantics (which includes support for branching, merging, and looping) and their ability to concurrently execute workflow steps. However, for many data-intensive workflows the dataflow model often requires data buffering. Current systems largely perform buffering through in-memory queues which can lead to buffer overflow and performance degradation as queues reach capacity (e.g., because of paging). We describe an alternative framework that leverages external storage to implement buffers (which we refer to as persistent queues) within data-intensive scientific workflows. Our framework can easily be used with different underlying storage technologies, and we consider and evaluate three distinct approaches: a traditional relational database implementation, a non-relational implementation designed for fast reads and writes, and a specialized approach that can further reduce external buffering overhead. In addition, the use of persistent queues can provide detailed provenance information ``for free'' by capturing the input and output information of each workflow component during workflow execution. Although many systems provide such provenance information, we show how this information can be captured both efficiently and can be used to improve overall workflow performance through persistent queues.
Year
DOI
Venue
2011
10.1109/SERVICES.2011.57
SERVICES
Keywords
Field
DocType
implementing persistent queues,workflow step,data-intensive scientific workflows,workflow component,output information,scientific workflow system,overall workflow performance,persistent queue,provenance information,detailed provenance information,workflow execution,dataflow model,computer model,queueing theory,data flow analysis,relational databases,relational database,buffer overflow,schedules,pipelines,computational modeling,dataflow,parallel processing
Workflow technology,Relational database,Computer science,Dataflow,Model of computation,External storage,Workflow engine,Workflow,Workflow management system,Database,Distributed computing
Conference
Citations 
PageRank 
References 
1
0.35
14
Authors
2
Name
Order
Citations
PageRank
Michael Agun171.17
Shawn Bowers2122386.44