Title
Probabilistic near-duplicate detection using simhash
Abstract
This paper offers a novel look at using a dimensionality-reduction technique called simhash to detect similar document pairs in large-scale collections. We show that this algorithm produces interesting intermediate data, which is normally discarded, that can be used to predict which of the bits in the final hash are more susceptible to being flipped in similar documents. This paves the way for a probabilistic search technique in the Hamming space of simhashes that can be significantly faster and more space-efficient than the existing simhash approaches. We show that with 95% recall compared to deterministic search of prior work, our method exhibits 4-14 times faster lookup and requires 2-10 times less RAM on our collection of 70M web pages.
Year
DOI
Venue
2011
10.1145/2063576.2063737
CIKM
Keywords
Field
DocType
probabilistic near-duplicate detection,dimensionality-reduction technique,large-scale collection,probabilistic search technique,similar document,similar document pair,interesting intermediate data,existing simhash approach,method exhibit,hamming space,final hash,clustering,web pages,hamming distance,similarity
Data mining,Duplicate detection,Information retrieval,Web page,Computer science,Hamming distance,Hash function,Hamming space,Probabilistic logic,Cluster analysis
Conference
Citations 
PageRank 
References 
13
0.64
17
Authors
2
Name
Order
Citations
PageRank
Sadhan Sood1130.64
Dmitri Loguinov2129891.08