Probabilistic near-duplicate detection using simhash - Citegraph

Paper Info

Title
Probabilistic near-duplicate detection using simhash

Abstract
This paper offers a novel look at using a dimensionality-reduction technique called simhash to detect similar document pairs in large-scale collections. We show that this algorithm produces interesting intermediate data, which is normally discarded, that can be used to predict which of the bits in the final hash are more susceptible to being flipped in similar documents. This paves the way for a probabilistic search technique in the Hamming space of simhashes that can be significantly faster and more space-efficient than the existing simhash approaches. We show that with 95% recall compared to deterministic search of prior work, our method exhibits 4-14 times faster lookup and requires 2-10 times less RAM on our collection of 70M web pages.

Year	DOI	Venue
2011	10.1145/2063576.2063737	CIKM
Keywords	Field	DocType
probabilistic near-duplicate detection,dimensionality-reduction technique,large-scale collection,probabilistic search technique,similar document,similar document pair,interesting intermediate data,existing simhash approach,method exhibit,hamming space,final hash,clustering,web pages,hamming distance,similarity	Data mining,Duplicate detection,Information retrieval,Web page,Computer science,Hamming distance,Hash function,Hamming space,Probabilistic logic,Cluster analysis	Conference
Citations	PageRank	References
13	0.64	17
Authors
2

Authors (2 rows)

Cited by (13 rows)

References (17 rows)

Name	Order	Citations	PageRank
Sadhan Sood	1	13	0.64
Dmitri Loguinov	2	1298	91.08

1