One-Pass, One-Hash n-Gram Statistics Estimation - Citegraph

Paper Info

Title
One-Pass, One-Hash n-Gram Statistics Estimation

Abstract
In multimedia, text or bioinformatics databases, applicat ions query se- quences of n consecutive symbols called n-grams. Estimating the number of distinct n-grams is a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire an unassum- ing algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashing the data, which is prohibitive for large data sources. We prove that a one-pass one-hash algorithm is suffi cient for accu- rate estimates if the hashing is sufficiently independent. T o reduce costs fur- ther, we investigate recursive random hashing algorithms and show that they are sufficiently independent in practice. We compare our run ning times with exact counts using suffix arrays and show that, while we use ha rdly any stor- age, we are an order of magnitude faster. The approach further is extended to a one-pass/one-hash computation of n-gram entropy and iceberg counts. The experiments use a large collection of English text from the Gutenberg Project as well as synthetic data.

Year	Venue	Keywords
2006	Clinical Orthopaedics and Related Research	synthetic data,databases,computer science
Field	DocType	Volume
Data mining,Computer science,Theoretical computer science,Synthetic data,n-gram,Recursion,Computation,Suffix,Sampling (statistics),Hash function,Statistics,Statistical assumption,Database	Journal	abs/cs/061
Citations	PageRank	References
0	0.34	36
Authors
2

Authors (2 rows)

Cited by (0 rows)

References (36 rows)

Name	Order	Citations	PageRank
Daniel Lemire	1	821	52.14
Owen Kaser	2	325	24.02

1