Title
Efficient Nearest-Neighbor Search in the Probability Simplex
Abstract
Document similarity tasks arise in many areas of information retrieval and natural language processing. A fundamental question when comparing documents is which representation to use. Topic models, which have served as versatile tools for exploratory data analysis and visualization, represent documents as probability distributions over latent topics. Systems comparing topic distributions thus use measures of probability divergence such as Kullback-Leibler, Jensen-Shannon, or Hellinger. This paper presents novel analysis and applications of the reduction of Hellinger divergence to Euclidean distance computations. This reduction allows us to exploit fast approximate nearest-neighbor (NN) techniques, such as locality-sensitive hashing (LSH) and approximate search in k-d trees, for search in the probability simplex. We demonstrate the effectiveness and efficiency of this approach on two tasks using latent Dirichlet allocation (LDA) document representations: discovering relationships between National Institutes of Health (NIH) grants and prior-art retrieval for patents. Evaluation on these tasks and on synthetic data shows that both Euclidean LSH and approximate k-d tree search perform well when a single nearest neighbor must be found. When a larger set of similar documents is to be retrieved, the k-d tree approach is more effective and efficient.
Year
DOI
Venue
2013
10.1145/2499178.2499189
ICTIR
Keywords
Field
DocType
efficient nearest-neighbor search,approximate search,probability simplex,euclidean lsh,probability divergence,approximate k-d tree search,k-d tree approach,k-d tree,approximate nearest-neighbor,euclidean distance computation,probability distribution,latent dirichlet allocation,topic models
k-nearest neighbors algorithm,Data mining,Latent Dirichlet allocation,Document clustering,Euclidean distance,Probability distribution,Hash function,Topic model,Nearest neighbor search,Mathematics
Conference
Citations 
PageRank 
References 
4
0.41
30
Authors
4
Name
Order
Citations
PageRank
K. Krstovski1286.24
David A. Smith284354.39
Wallach Hanna3144577.79
Andrew McGregor427118.36