Abstract | ||
---|---|---|
This paper offers a novel look at using a dimensionality-reduction technique called simhash to detect similar document pairs in large-scale collections. We show that this algorithm produces interesting intermediate data, which is normally discarded, that can be used to predict which of the bits in the final hash are more susceptible to being flipped in similar documents. This paves the way for a probabilistic search technique in the Hamming space of simhashes that can be significantly faster and more space-efficient than the existing simhash approaches. We show that with 95% recall compared to deterministic search of prior work, our method exhibits 4-14 times faster lookup and requires 2-10 times less RAM on our collection of 70M web pages. |
Year | DOI | Venue |
---|---|---|
2011 | 10.1145/2063576.2063737 | CIKM |
Keywords | Field | DocType |
probabilistic near-duplicate detection,dimensionality-reduction technique,large-scale collection,probabilistic search technique,similar document,similar document pair,interesting intermediate data,existing simhash approach,method exhibit,hamming space,final hash,clustering,web pages,hamming distance,similarity | Data mining,Duplicate detection,Information retrieval,Web page,Computer science,Hamming distance,Hash function,Hamming space,Probabilistic logic,Cluster analysis | Conference |
Citations | PageRank | References |
13 | 0.64 | 17 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Sadhan Sood | 1 | 13 | 0.64 |
Dmitri Loguinov | 2 | 1298 | 91.08 |