Title
A Bayesian Nonparametric View on Count-Min Sketch.
Abstract
The count-min sketch is a time-and memory-efficient randomized data structure that provides a point estimate of the number of times an item has appeared in a data stream. The count-min sketch and related hash-based data structures are ubiquitous in systems that must track frequencies of data such as URLs, IP addresses, and language n - grams. We present a Bayesian view on the count-min sketch, using the same data structure, but providing a posterior distribution over the frequencies that characterizes the uncertainty arising from the hash-based approximation. In particular, we take a nonparametric approach and consider tokens generated from a Dirichlet process (DP) random measure, which allows for an unbounded number of unique tokens. Using properties of the DP, we show that it is possible to straightforwardly compute posterior marginals of the unknown true counts and that the modes of these marginals recover the count-min sketch estimator, inheriting the associated probabilistic guarantees. Using simulated data and text data, we investigate the properties of these estimators. Lastly, we also study a modified problem in which the observation stream consists of collections of tokens (i.e., documents) arising from a random measure drawn from a stable beta process, which allows for power law scaling behavior in the number of unique tokens.
Year
Venue
Keywords
2018
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018)
deep neural networks,data stream,posterior distribution,dirichlet process,point estimate,random measure,count-min sketch,data structure,ip addresses,theoretical neuroscience
Field
DocType
Volume
Data structure,Mathematical optimization,Dirichlet process,Computer science,Algorithm,Posterior probability,Hash function,Probabilistic logic,Count–min sketch,Random measure,Sketch
Conference
31
ISSN
Citations 
PageRank 
1049-5258
0
0.34
References 
Authors
0
3
Name
Order
Citations
PageRank
Cai, Diana111.03
Michael Mitzenmacher27386730.89
Ryan P. Adams32286131.88