Title
How to juggle columns: an entropy-based approach for table compression
Abstract
Many relational databases exhibit complex dependencies between data attributes, caused either by the nature of the underlying data or by explicitly denormalized schemas. In data warehouse scenarios, calculated key figures may be materialized or hierarchy levels may be held within a single dimension table. Such column correlations and the resulting data redundancy may result in additional storage requirements. They may also result in bad query performance if inappropriate independence assumptions are made during query compilation. In this paper, we tackle the specific problem of detecting functional dependencies between columns to improve the compression rate for column-based database systems, which both reduces main memory consumption and improves query performance. Although a huge variety of algorithms have been proposed for detecting column dependencies in databases, we maintain that increased data volumes and recent developments in hardware architectures demand novel algorithms with much lower runtime overhead and smaller memory footprint. Our novel approach is based on entropy estimations and exploits a combination of sampling and multiple heuristics to render it applicable for a wide range of use cases. We demonstrate the quality of our approach by means of an implementation within the SAP NetWeaver Business Warehouse Accelerator. Our experiments indicate that our approach scales well with the number of columns and produces reliable dependence structure information. This both reduces memory consumption and improves performance for nontrivial queries.
Year
DOI
Venue
2010
10.1145/1866480.1866510
IDEAS
Keywords
Field
DocType
data redundancy,data attribute,nontrivial query,approach scale,bad query performance,memory consumption,increased data volume,main memory consumption,table compression,entropy-based approach,underlying data,data warehouse scenario,use case,optimistic concurrency control,hardware architecture,relational database,data warehouse,entropy estimation,functional dependency,mobile ad hoc networks,database system
Data warehouse,Data mining,Data compression ratio,Relational database,Computer science,Functional dependency,Heuristics,Data redundancy,Memory footprint,Database,Optimistic concurrency control
Conference
Citations 
PageRank 
References 
7
0.84
9
Authors
7
Name
Order
Citations
PageRank
Marcus Paradies18210.36
Christian Lemke2805.25
Hasso Plattner377063.43
Wolfgang Lehner42243294.69
Kai-uwe Sattler51144126.81
Alexander Zeier653143.67
Jens Krueger717612.20