Title
Similarity join size estimation using locality sensitive hashing
Abstract
Similarity joins are important operations with a broad range of applications. In this paper, we study the problem of vector similarity join size estimation (VSJ). It is a generalization of the previously studied set similarity join size estimation (SSJ) problem and can handle more interesting cases such as TF-IDF vectors. One of the key challenges in similarity join size estimation is that the join size can change dramatically depending on the input similarity threshold. We propose a sampling based algorithm that uses Locality-Sensitive-Hashing (LSH). The proposed algorithm LSH-SS uses an LSH index to enable effective sampling even at high thresholds. We compare the proposed technique with random sampling and the state-of-the-art technique for SSJ (adapted to VSJ) and demonstrate LSH-SS offers more accurate estimates throughout the similarity threshold range and small variance using real-world data sets.
Year
DOI
Venue
2011
10.14778/1978665.1978666
Clinical Orthopaedics and Related Research
Keywords
DocType
Volume
similarity threshold range,input similarity threshold,size estimation,lsh index,high threshold,effective sampling,vector similarity,broad range,proposed algorithm,random sampling,locality sensitive hashing
Journal
4
Issue
ISSN
Citations 
6
Proceedings of the VLDB Endowment (PVLDB), Vol. 4, No. 6, pp. 338-349 (2011)
27
PageRank 
References 
Authors
0.87
15
3
Name
Order
Citations
PageRank
Hongrae Lee123913.78
Raymond T. Ngt267271192.90
Kyuseok Shim35120752.19