Title
Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees
Abstract
Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold θ. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The paper considers the efficient evaluation of such queries, as well as of the closely related top-k cosine similarity queries. It provides novel optimality guarantees that exhibit good performance on real datasets. We take as a starting point Fagin’s well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified for θ-similarity. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that multiple real-world data sets from mass spectrometry, natural language process, and computer vision exhibit a certain form of data skewness and we exploit this property to obtain better traversal strategies. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine.
Year
DOI
Venue
2019
10.1007/s00224-020-10009-6
Theory of Computing Systems
Keywords
DocType
Volume
Vector databases, Similarity search, Cosine, Threshold algorithm
Conference
65
Issue
ISSN
Citations 
1
1432-4350
0
PageRank 
References 
Authors
0.34
44
5
Name
Order
Citations
PageRank
Yuliang Li15416.90
Jianguo Wang2696.18
Benjamin Pullman300.34
Nuno Bandeira4537.33
Yannis Papakonstantinou55657837.56