Title
Efficient Data Blocking and Skipping Framework Applying Heuristic Rules
Abstract
Data blocking has been an effective technique of data skipping to reduce data access and shorten query response time in query engines. By generating fine-grained, balanced blocks and corresponding metadata, a query may skip a block if the metadata indicates that the block does not contain relevant data. Obviously, the deciding factor of a promising blocking strategy depends on how to produce effective data layout in reasonable time that is expected to skip most data. In this paper, we propose several algorithms that drastically reduce the time complexity of existent blocking strategies based on workload analysis, at the cost of relatively small loss of estimated tuples could be skipped. Via theoretical analysis, we prove that the time complexity of our algorithms is apparently lower than that of ward algorithm. Afterwards, we demonstrate the whole blocking and skipping workflow, install it into Spark SQL and obtain experimental evaluation results. Experimental results show that our technique gains significant improvement in aspect of blocking efficiency compared to ward algorithm, while keeping almost the same level of skipping ability.
Year
DOI
Venue
2017
10.1109/ICPADS.2017.00037
2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS)
Keywords
Field
DocType
data blocking,data skipping,workload,metadata,query response time,Spark SQL
Data warehouse,Metadata,Heuristic,Spark (mathematics),Tuple,Computer science,Algorithm,Time complexity,Cluster analysis,Data access,Distributed computing
Conference
ISSN
ISBN
Citations 
1521-9097
978-1-5386-3208-6
0
PageRank 
References 
Authors
0.34
0
5
Name
Order
Citations
PageRank
Yong Wang127592.19
Xiao-Chun Yun221541.96
Xi Wang341.08
Shupeng Wang45919.97
Yongshang Wu501.01