Title
Octopus: Hybrid Big Data Integration Engine.
Abstract
Nowadays large enterprises maintain a huge amount of data in multiple backend systems including traditional database systems and recently popular big data systems. In an example of telecom providers, the key business data (e.g., billing information) is maintained in database systems whereas the huge signaling log data is on HDFS with Hive. How to integrate such data and provide a consolidate query and analytic becomes a challenging task. Neither traditional database warehouse nor recent Big Data system (e.g. Apache Spark and Hadoop) can fully leverage the power of each backend system. In this paper, we build a hybrid data processing engine, called Octopus, to fully integrate backend systems. Given the backend systems, data is distributed at multiple locations. Octopus focuses on the optimization of the amount of data movement. To this end, Octopus proposes a technique of query pushdown for such optimization. A proof-of-concept prototype of Octopus successfully verifies that Octopus can achieve much faster running time than Spark. For example, Octopus outperforms the recent Spark version 1.4.0 by 5.25 X faster running time to process an aggregation query.
Year
DOI
Venue
2015
10.1109/CloudCom.2015.111
CloudCom
Keywords
Field
DocType
Octopus,hybrid Big Data integration engine,large enterprises,database systems,Big Data systems,telecom providers,business data,signaling log data,HDFS,Hive,hybrid data processing engine,backend systems,distributed data,optimization,data movement,query pushdown,aggregation query
Data warehouse,Spark (mathematics),Computer science,Hybrid data,Big data,Operating system,Business data,Distributed computing
Conference
ISSN
Citations 
PageRank 
2330-2194
4
0.45
References 
Authors
0
5
Name
Order
Citations
PageRank
Yanjie Chen140.45
Chenyang Xu258523.07
Weixiong Rao320327.25
Hong Min4625.42
Gong Su529142.46