Title
Fast and Near-Optimal Algorithms for Approximating Distributions by Histograms
Abstract
Histograms are among the most popular structures for the succinct summarization of data in a variety of database applications. In this work, we provide fast and near-optimal algorithms for approximating arbitrary one dimensional data distributions by histograms. A k-histogram is a piecewise constant function with k pieces. We consider the following natural problem, previously studied by Indyk, Levi, and Rubinfeld in PODS 2012: given samples from a distribution p over {1,...,n}, compute a k histogram that minimizes the l2-distance from p, up to an additive ε. We design an algorithm for this problem that uses the information-theoretically minimal sample size of m = O(1/ε2), runs in sample-linear time O(m), and outputs an O(k)-histogram whose l2-distance from p is at most O(optk) +ε, where optk is the minimum l2-distance between p and any k-histogram. Perhaps surprisingly, the sample size and running time of our algorithm are independent of the universe size. We generalize our approach to obtain fast algorithms for multi-scale histogram construction, as well as approximation by piecewise polynomial distributions. We experimentally demonstrate one to two orders of magnitude im rovement in terms of empirical running times over previous state-of-the-art algorithms.
Year
DOI
Venue
2015
10.1145/2745754.2745772
ACM SIGMOD Conference on Principles of DB Systems
Field
DocType
Citations 
Histogram,Automatic summarization,Discrete mathematics,Combinatorics,Polynomial,Algorithm,Constant function,Database theory,Order of magnitude,Sample size determination,Piecewise,Mathematics
Conference
12
PageRank 
References 
Authors
0.62
8
5
Name
Order
Citations
PageRank
Jayadev Acharya120926.37
Ilias Diakonikolas277664.21
Chinmay Hegde397763.40
Jerry Li422922.67
Ludwig Schmidt568431.03