Title
Low Latency and Resource-Aware Program Composition for Large-Scale Data Analysis
Abstract
The importance of large-scale data analysis has shown a recent increase in a wide variety of areas, such as natural language processing, sensor data analysis, and scientific computing. Such an analysis application typically reuses existing programs as components and is often required to continuously process new data with low latency while processing large-scale data on distributed computation nodes. However, existing frameworks for combining programs into a parallel data analysis pipeline (e.g., workflow) are plagued by the following issues: (1) Most frameworks are oriented toward high-throughput batch processing, which leads to high latency. (2) A specific language is often imposed for the composition and/or such a specific structure as a simple unidirectional dataflow among constituting tasks. (3) A program used as a component often takes a long time to start up due to the heavy load at initialization, which is referred to as the startup overhead. Our solution to these problems is a remote procedure call (RPC)-based composition, which is achieved by our middleware Rapid Service Connector (RaSC). RaSC can easily wrap an ordinary program and make it accessible as an RPC service, called a RaSC service. Using such component programs as RaSC services enables us to integrate them into one program with low latency without being restricted to a specific workflow language or dataflow structure. In addition, a RaSC service masks the startup overhead of a component program by keeping the processes of the component program alive across RPC requests. We also proposed architecture that automatically manages the number of processes to maximize the throughput. The experimental results showed that our approach excels in overall throughput as well as latency, despite its RPC overhead. We also showed that our approach can adapt to runtime changes in the throughput requirements.
Year
DOI
Venue
2016
10.1109/CCGrid.2016.88
2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)
Keywords
Field
DocType
Large-scale data processing,program composition,service composition
Middleware,Remote procedure call,Latency (engineering),Computer science,Real-time computing,Dataflow,Latency (engineering),Throughput,Distributed database,Workflow,Distributed computing
Conference
ISSN
ISBN
Citations 
2376-4414
978-1-5090-2454-4
2
PageRank 
References 
Authors
0.45
14
3
Name
Order
Citations
PageRank
Masahiro Tanaka1567.00
Kenjiro Taura255155.30
Kentaro Torisawa388170.45