Efficient Set Similarity Joins Using Min-prefixes - Citegraph

Paper Info

Title
Efficient Set Similarity Joins Using Min-prefixes

Abstract
Identification of all objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most set similarity join methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to previous algorithms.

Year	DOI	Venue
2009	10.1007/978-3-642-03973-7_8	ADBIS
Keywords	Field	DocType
actual similarity measure,verification phase,candidate generation,efficient set similarity,better pruning result,candidate pair,previous work,previous algorithm,major importance,major effort,similarity notion,indexation	Inverted index,Data mining,Joins,Abstraction,Data analysis,Similarity measure,Workload,Computer science,Prefix,Database	Conference
Volume	ISSN	Citations
5739	0302-9743	13
PageRank	References	Authors
0.58	12	2

Authors (2 rows)

Cited by (13 rows)

References (12 rows)

Name	Order	Citations	PageRank
Leonardo A. Ribeiro	1	18	1.02
Theo Härder	2	1132	307.12

1