Title
One-Pass, One-Hash n-Gram Statistics Estimation
Abstract
In multimedia, text or bioinformatics databases, applicat ions query se- quences of n consecutive symbols called n-grams. Estimating the number of distinct n-grams is a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire an unassum- ing algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass one-hash algorithm is suffi cient for accu- rate estimates if the hashing is sufficiently independent. T o reduce costs fur- ther, we investigate recursive random hashing algorithms and show that they are sufficiently independent in practice. We compare our run ning times with exact counts using suffix arrays and show that, while we use ha rdly any stor- age, we are an order of magnitude faster. The approach further is extended to a one-pass/one-hash computation of n-gram entropy and iceberg counts. The experiments use a large collection of English text from the Gutenberg Project as well as synthetic data.
Year
Venue
Keywords
2006
Clinical Orthopaedics and Related Research
synthetic data,databases,computer science
Field
DocType
Volume
Data mining,Computer science,Theoretical computer science,Synthetic data,n-gram,Recursion,Computation,Suffix,Sampling (statistics),Hash function,Statistics,Statistical assumption,Database
Journal
abs/cs/061
Citations 
PageRank 
References 
0
0.34
36
Authors
2
Name
Order
Citations
PageRank
Daniel Lemire182152.14
Owen Kaser232524.02