Abstract | ||
---|---|---|
Global-scale organizations produce large volumes of data across geographically distributed data centers. Querying and analyzing such data as a whole introduces new research issues at the intersection of networks and databases. Today systems that compute SQL analytics over geographically distributed data operate by pulling all data to a central location. This is problematic at large data scales due to expensive transoceanic links, and may be rendered impossible by emerging regulatory constraints. The new problem of Wide-Area Big Data (WABD) consists in orchestrating query execution across data centers to minimize bandwidth while respecting regulatory constaints. WABD combines classical query planning with novel network-centric mechanisms designed for a wide-area setting such as pseudodistributed execution, joint query optimization, and deltas on cached subquery results. Our prototype, Geode, builds upon Hive and uses 250× less bandwidth than centralized analytics in a Microsoft production workload and up to 360× less on popular analytics benchmarks including TPC-CH and Berkeley Big Data. Geode supports all SQL operators, including Joins, across global data. |
Year | Venue | Field |
---|---|---|
2015 | NSDI | Query optimization,SQL,Joins,Workload,Computer science,Cache,Bandwidth (signal processing),Analytics,Big data,Database,Distributed computing |
DocType | Citations | PageRank |
Conference | 41 | 1.16 |
References | Authors | |
27 | 6 |
Name | Order | Citations | PageRank |
---|---|---|---|
Ashish Vulimiri | 1 | 187 | 8.44 |
Carlo Curino | 2 | 2012 | 90.35 |
P. Brighten Godfrey | 3 | 2519 | 145.37 |
Thomas Jungblut | 4 | 41 | 1.16 |
Jitendra Padhye | 5 | 6770 | 514.84 |
George Varghese | 6 | 8149 | 727.66 |