Title
An efficient similarity join algorithm with cosine similarity predicate
Abstract
Given a large collection of objects, finding all pairs of similar objects, namely similarity join, is widely used to solve various problems in many application domains.Computation time of similarity join is critical issue, since similarity join requires computing similarity values for all possible pairs of objects. Several existing algorithms adopt prefix filtering to avoid unnecessary similarity computation; however, existing algorithms implementing the prefix filtering have inefficiency in filtering out object pairs, in particular, when aggregate weighted similarity function, such as cosine similarity, is used to quantify similarity values between objects. This is mostly caused by large prefixes the algorithms select. In this paper, we propose an alternative method to select small prefixes by exploiting the relationship between arithmetic mean and geometric mean of elements' weights. A new algorithm, MMJoin, implementing the proposed methods dramatically reduces the average size of prefixes without much overhead. Finally, it saves much computation time. We demonstrate that our algorithm outperforms a state-of-the-art one with empirical evaluation on large-scale real world datasets.
Year
DOI
Venue
2010
10.1007/978-3-642-15251-1_33
DEXA (2)
Keywords
Field
DocType
large prefix,existing algorithm,unnecessary similarity computation,cosine similarity predicate,computation time,cosine similarity,geometric mean,aggregate weighted similarity function,arithmetic mean,large collection,similarity value,efficient similarity
Edit distance,Data mining,Cosine similarity,Computer science,Arithmetic mean,Prefix,Theoretical computer science,Computation,Inverted index,Filter (signal processing),Algorithm,Geometric mean,Database
Conference
Volume
ISSN
ISBN
6262
0302-9743
3-642-15250-3
Citations 
PageRank 
References 
19
0.79
19
Authors
4
Name
Order
Citations
PageRank
Dongjoo Lee118212.87
Jaehui Park2504.10
Junho Shim355977.12
Sang-goo Lee4832151.04