Title
Efficient aggregation algorithms for probabilistic data
Abstract
We study the problem of computing aggregation operators on probabilistic data in an I/O efficient manner. Algorithms for aggregation operators such as SUM, COUNT, AVG, and MIN/MAX are crucial to applications on probabilistic databases. We give a generalization of the classical data stream model to handle probabilistic data, called probabilistic streams, in order to analyze the I/O-requirements of our algorithms. Whereas the algorithms for SUM and COUNT turn out to be simple, the problem is harder for both AVG and MIN/MAX. Although data stream algorithms typically use randomness, all of the algorithms we present are deterministic. For MIN and MAX, we obtain efficient one-pass data stream algorithms for estimating each of these quantities with relative accuracy (1 + ε), using constant update time per element and O(1/ε lg R) space, where each element has a value between 1 and R. For AVG, we present a new data stream algorithm for estimating its value to a relative accuracy (1 + ε) in O(log n) passes over the data with O(1/ε log2 n) space and update time O(1/ε log n) per element. On the other hand, we prove a space lower bound of Ω(n) for any exact one-pass deterministic data stream algorithm. Complementing this result, we also present an O(n log2 n)-time exact deterministic algorithm which uses O(n) space (thus removing the data-streaming restriction), improving dramatically on the previous O(n3)-time algorithm. Our algorithms for AVG involve a novel technique based on generating functions and numerical integration, which may be of independent interest. Finally, we provide an experimental analysis and show that our algorithms, coupled with additional heuristics, have excellent performance over large data sets.
Year
DOI
Venue
2007
10.5555/1283383.1283420
SODA
Keywords
Field
DocType
large data set,relative accuracy,aggregation operator,log2 n,efficient aggregation algorithm,probabilistic data,deterministic data stream algorithm,classical data stream model,efficient one-pass data stream,log n,data stream,probabilistic database
Binary logarithm,Discrete mathematics,Data set,Combinatorics,Upper and lower bounds,Computer science,Data stream,Algorithm,Probabilistic analysis of algorithms,Probabilistic logic,Deterministic algorithm,Randomness
Conference
ISBN
Citations 
PageRank 
978-0-89871-624-5
55
1.68
References 
Authors
21
3
Name
Order
Citations
PageRank
T. S. Jayram1137375.87
Satyen Kale2143690.68
Erik Vee374743.61