Abstract | ||
---|---|---|
There is a growing interest in on-line algorithms for analyzing and querying data streams, that examine each stream element only once and have at their disposal, only a limited amount of memory. Providing (perhaps approximate) answers to aggregate queries over such streams is a crucial requirement for many application environments; examples include large IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed. In this paper, we present the skimmed-sketch algorithm for estimating the join size of two streams. (Our techniques also readily extend to other join-aggregate queries.) To the best of our knowledge, our skimmed-sketch technique is the first comprehensive join-size estimation algorithm to provide tight error guarantees while: (1) achieving the lower bound on the space required by any join-size estimation method in a streaming environment, (2) handling streams containing general update operations (inserts and deletes), (3) incurring a low logarithmic processing time per stream element, and (4) not assuming any a-priori knowledge of the frequency distribution for domain values. Our skimmed-sketch technique achieves all of the above by first skimming the dense frequencies from random hash-sketch summaries of the two streams. It then computes the subjoin size involving only dense frequencies directly, and uses the skimmed sketches only to approximate subjoin sizes for the non-dense frequencies. Results from our experimental study with real-life as well as synthetic data streams indicate that our skimmed-sketch algorithm provides significantly more accurate estimates for join sizes compared to earlier sketch-based techniques. |
Year | DOI | Venue |
---|---|---|
2004 | 10.1007/978-3-540-24741-8_33 | ADVANCES IN DATABASE TECHNOLOGY - EDBT 2004, PROCEEDINGS |
Keywords | Field | DocType |
lower bound,a priori knowledge | Data processing,Database query,Data element,Data stream,Computer science,Algorithm,Hash function,Logarithm,Hash table,Sketch | Conference |
Volume | ISSN | Citations |
2992 | 0302-9743 | 22 |
PageRank | References | Authors |
1.17 | 16 | 3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Sumit Ganguly | 1 | 813 | 236.01 |
Minos Garofalakis | 2 | 4904 | 664.22 |
Rajeev Rastogi | 3 | 6151 | 827.22 |