Title
Efficient Similarity Joins on Massive High-Dimensional Datasets Using MapReduce
Abstract
High-dimensional similarity join (HDSJ) is critical for many novel applications in the domain of mobile data management. Nowadays, performing HDSJs efficiently faces two challenges. First, the scale of datasets is increasing rapidly, making parallel computing on a scalable platform a must. Second, the dimensionality of the data can be up to hundreds or even thousands, which brings about the issue of dimensionality curse. In this paper, we address these challenges and study how to perform parallel HDSJs efficiently in the MapReduce paradigm. Particularly, we propose a cost model to demonstrate that it is important to take both communication and computation costs into account as dimensionality and data volume increases. To this end, we propose DAA (Dimension Aggregation Approximation), an efficient compression approach that can help significantly reduce both these costs when performing parallel HDSJs. Moreover, we design DAA-based parallel HDSJ algorithms which can scale up to massive data sizes and very high dimensionality. We perform extensive experiments using both synthetic and real datasets to evaluate the speedup and the scale up of our algorithms.
Year
DOI
Venue
2012
10.1109/MDM.2012.25
MDM
Keywords
Field
DocType
massive high-dimensional,parallel processing,mapreduce,compression approach,efficient similarity joins,high-dimensional similarity join,mobile data management,data compression,massive data size,daa,dimensionality curse,parallel hdsjs,data volume increase,parallel hdsj algorithm,scalable platform,high dimensionality,real datasets,parallel computing,dimension aggregation approximation,high-dimensional datasets,mobile computing,hdsj,computational modeling,data models,algorithm design and analysis,approximation algorithms,vectors,time series analysis
Approximation algorithm,Data mining,Data modeling,Joins,Algorithm design,Computer science,Curse of dimensionality,Data compression,Scalability,Speedup
Conference
ISBN
Citations 
PageRank 
978-0-7695-4713-8
12
0.69
References 
Authors
18
4
Name
Order
Citations
PageRank
Wuman Luo122312.51
haoyu tan232618.31
Huajian Mao31016.41
Lionel M. Ni49462802.67