Title
Processing Data-Stream Join Aggregates Using Skimmed Sketches
Abstract
There is a growing interest in on-line algorithms for analyzing and querying data streams, that examine each stream element only once and have at their disposal, only a limited amount of memory. Providing (perhaps approximate) answers to aggregate queries over such streams is a crucial requirement for many application environments; examples include large IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed. In this paper, we present the skimmed-sketch algorithm for estimating the join size of two streams. (Our techniques also readily extend to other join-aggregate queries.) To the best of our knowledge, our skimmed-sketch technique is the first comprehensive join-size estimation algorithm to provide tight error guarantees while: (1) achieving the lower bound on the space required by any join-size estimation method in a streaming environment, (2) handling streams containing general update operations (inserts and deletes), (3) incurring a low logarithmic processing time per stream element, and (4) not assuming any a-priori knowledge of the frequency distribution for domain values. Our skimmed-sketch technique achieves all of the above by first skimming the dense frequencies from random hash-sketch summaries of the two streams. It then computes the subjoin size involving only dense frequencies directly, and uses the skimmed sketches only to approximate subjoin sizes for the non-dense frequencies. Results from our experimental study with real-life as well as synthetic data streams indicate that our skimmed-sketch algorithm provides significantly more accurate estimates for join sizes compared to earlier sketch-based techniques.
Year
DOI
Venue
2004
10.1007/978-3-540-24741-8_33
ADVANCES IN DATABASE TECHNOLOGY - EDBT 2004, PROCEEDINGS
Keywords
Field
DocType
lower bound,a priori knowledge
Data processing,Database query,Data element,Data stream,Computer science,Algorithm,Hash function,Logarithm,Hash table,Sketch
Conference
Volume
ISSN
Citations 
2992
0302-9743
22
PageRank 
References 
Authors
1.17
16
3
Name
Order
Citations
PageRank
Sumit Ganguly1813236.01
Minos Garofalakis24904664.22
Rajeev Rastogi36151827.22