Improved robustness of signature-based near-replica detection via lexicon randomization - Citegraph

Paper Info

Title
Improved robustness of signature-based near-replica detection via lexicon randomization

Abstract
Detection of near duplicate documents is an important problem in many data mining and information filtering applications. When faced with massive quantities of data, traditional duplicate detection techniques relying on direct inter-document similarity computation (e.g., using the cosine measure) are often not feasible given the time and memory performance constraints. On the other hand, fingerprint-based methods, such as I-Match, are very attractive computationally but may be brittle with respect to small changes to document content. We focus on approaches to near-replica detection that are based upon large-collection statistics and present a general technique of increasing their robustness via multiple lexicon randomization. In experiments with large web-page and spam-email datasets the proposed method is shown to consistently outperform traditional I-Match, with the relative improvement in duplicate-document recall reaching as high as 40-60%. The large gains in detection accuracy are offset by only small increases in computational requirements.

Year	DOI	Venue
2004	10.1145/1014052.1014127	KDD
Keywords	Field	DocType
attractive computationally,small increase,lexicon randomization,data mining,near duplicate document,large web-page,traditional i-match,detection accuracy,signature-based near-replica detection,large gain,small change,traditional duplicate detection technique,improved robustness,data cleaning,web mining,deduplication,web pages	Data deduplication,Replica,Data mining,Web mining,Computer science,Filter (signal processing),Fingerprint,Robustness (computer science),Lexicon,Artificial intelligence,Offset (computer science),Machine learning	Conference
ISBN	Citations	PageRank
1-58113-888-1	43	2.53
References	Authors
16	3

Authors (3 rows)

Cited by (43 rows)

References (16 rows)

Name	Order	Citations	PageRank
Aleksander Kołcz	1	628	66.65
Abdur Chowdhury	2	2013	160.59
Joshua Alspector	3	445	267.78

1