Title
Evaluating partitioning and bucketing strategies for Hive-based Big Data Warehousing systems.
Abstract
Hive has long been one of the industry-leading systems for Data Warehousing in Big Data contexts, mainly organizing data into databases, tables, partitions and buckets, stored on top of an unstructured distributed file system like HDFS. Some studies were conducted for understanding the ways of optimizing the performance of several storage systems for Big Data Warehousing. However, few of them explore the impact of data organization strategies on query performance, when using Hive as the storage technology for implementing Big Data Warehousing systems. Therefore, this paper evaluates the impact of data partitioning and bucketing in Hive-based systems, testing different data organization strategies and verifying the efficiency of those strategies in query performance. The obtained results demonstrate the advantages of implementing Big Data Warehouses based on denormalized models and the potential benefit of using adequate partitioning strategies. Defining the partitions aligned with the attributes that are frequently used in the conditions/filters of the queries can significantly increase the efficiency of the system in terms of response time. In the more intensive workload benchmarked in this paper, overall decreases of about 40% in processing time were verified. The same is not verified with the use of bucketing strategies, which shows potential benefits in very specific scenarios, suggesting a more restricted use of this functionality, namely in the context of bucketing two tables by the join attribute of these tables.
Year
DOI
Venue
2019
10.1186/s40537-019-0196-1
Journal of Big Data
Keywords
Field
DocType
Big Data, Big Data Warehouse, Hive, Partitions, Buckets
Distributed File System,Data warehouse,Computational Science and Engineering,Data mining,Computer science,Workload,Response time,Big data,Data partitioning
Journal
Volume
Issue
ISSN
6
1
2196-1115
Citations 
PageRank 
References 
1
0.35
0
Authors
3
Name
Order
Citations
PageRank
Eduarda Costa191.86
Carlos Costa2389.15
Maribel Yasmina Santos314635.41