Title
PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning
Abstract
AbstractIn big data query processing, there is a trade-off between query accuracy and query efficiency, for example, sampling query approaches trade-off query completeness for efficiency. In this article, we argue that query performance can be significantly improved by slightly losing the possibility of query completeness, that is, the chance that a query is complete. To quantify the possibility, we define a new concept, Probability of query Completeness (hereinafter referred to as PC). For example, If a query is executed 100 times, PC = 0.95 guarantees that there are no more than 5 incomplete results among 100 results. Leveraging the probabilistic data placement and scanning, we trade off PC for query performance. In the article, we propose PoBery (POssibly-complete Big data quERY), a method that supports neither complete queries nor incomplete queries, but possibly-complete queries. The experimental results conducted on HiBench prove that PoBery can significantly accelerate queries while ensuring the PC. Specifically, it is guaranteed that the percentage of complete queries is larger than the given PC confidence. Through comparison with state-of-the-art key-value stores, we show that while Drill-based PoBery performs as fast as Drill on complete queries, it is 1.7 ×, 1.1 ×, and 1.5 × faster on average than Drill, Impala, and Hive, respectively, on possibly-complete queries.
Year
DOI
Venue
2021
10.1145/3465375
ACM/IMS Transactions on Data Science
DocType
Volume
Issue
Journal
2
3
ISSN
Citations 
PageRank 
2691-1922
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Jie Song100.34
Qiang He220418.15
Feifei Chen329622.56
Ye Yuan4826.46
Ye Yuan511724.40