Title
Bucket-size balancing locality sensitive hashing using the map reduce paradigm
Abstract
Similarity search is an essential operation in such domains as data mining and content-based information retrieval. This simple operation causes considerable burden when the number of data records grows large, especially in big data applications. At the sacrifice of accuracy, approximate methods for finding similar ones have been developed to deliver effective services in a reasonable amount of time. Locality sensitive hashing is a class of efficient approximate similarity search techniques. Various algorithms have been proposed for locality sensitive hashing, which basically try to narrow down the candidate data set to be examined. The candidate data set does not always contain all the similar data to query and thus the search results are approximate. The increase in the size of a candidate set improves the recall of similar ones, but it deteriorates the processing speed. This paper is concerned with a method to increase the recall rate while not entailing significant cost. The method basically uses a random hyperplane partitioning technique to create buckets to which data objects are distributed. The nearest neighbors located on the other side of such hyperplanes can be false negatives when only the bucket to which query belongs is examined for finding similar neighbors. The proposed method extends the hyperplanes to occupy their vicinity so that the data objects in the vicinity of a hyperplane are treated as belonging to both sides of the hyperplane simultaneously. The over-sized buckets are further split by adding additional hyperplanes to control the bucket sizes. To improve the processing speed, the algorithm is realized in MapReduce paradigm on a Hadoop cluster. Some experiment results are presented to show its applicability.
Year
DOI
Venue
2019
10.1007/s10586-017-1013-2
Cluster Computing
Keywords
Field
DocType
Locality sensitive hashing, Similarity search, Hadoop, MapReduce
Locality-sensitive hashing,Data mining,Locality preserving hashing,Recall rate,Computer science,Hyperplane,Data objects,Big data,Nearest neighbor search,Data records
Journal
Volume
Issue
ISSN
22
SUPnan
1573-7543
Citations 
PageRank 
References 
1
0.35
31
Authors
4
Name
Order
Citations
PageRank
Kyung Mi Lee171.94
Yoon-Su Jeong24119.94
Sang-Ho Lee3284.50
Keon Myung Lee46018.73