Title
Efficient estimation for high similarities using odd sketches
Abstract
Estimating set similarity is a central problem in many computer applications. In this paper we introduce the Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. The exclusive-or of two sketches equals the sketch of the symmetric difference of the two sets. This means that Odd Sketches provide a highly space-efficient estimator for sets of high similarity, which is relevant in applications such as web duplicate detection, collaborative filtering, and association rule learning. The method extends to weighted Jaccard similarity, relevant e.g. for TF-IDF vector comparison. We present a theoretical analysis of the quality of estimation to guarantee the reliability of Odd Sketch-based estimators. Our experiments confirm this efficiency, and demonstrate the efficiency of Odd Sketches in comparison with $b$-bit minwise hashing schemes on association rule learning and web duplicate detection tasks.
Year
DOI
Venue
2014
10.1145/2566486.2568017
WWW
Keywords
Field
DocType
association rule learning,odd sketch,compact binary sketch,association rule,high similarity,weighted jaccard similarity,tf-idf vector comparison,efficient estimation,odd sketches,odd sketch-based estimator,jaccard similarity,bloom filters
Data mining,Symmetric difference,Computer science,Theoretical computer science,Artificial intelligence,Jaccard index,Sketch,Bloom filter,World Wide Web,Collaborative filtering,Association rule learning,Hash function,Machine learning,Estimator
Conference
Citations 
PageRank 
References 
19
0.66
18
Authors
3
Name
Order
Citations
PageRank
Michael Mitzenmacher17386730.89
Rasmus Pagh2134486.08
Ninh Pham31697.68