Title
How Improve Set Similarity Join Based On Prefix Approach In Distributed Environment
Abstract
Set similarity join is an essential operation to find similar pairs of records in data integration and data analytics applications. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity join using distributed frameworks (e.g. MapReduce). In particular, it is publicly available a MapReduce implementation of the PPJoin, that was experimentally demonstrated as one of the best set similarity join algorithm. However, these techniques produce huge amounts of duplicates in order to perform a successful parallel processing. Moreover, these approaches do not guarantee the load balancing, which generates skewness problem and less scalability of these techniques. To address these problems, we propose a duplicate-free technique called TTJoin, that performs set similarity join efficiently by utilizing an innovative filter derived from the prefix filter. Moreover, we implemented TTJoin on Apache Spark, that is one of the most innovative distributed framework. Several experiments on real-world datasets demonstrate the effectiveness of proposed solution with respect to either traditional TTJoin MapReduce implementation.
Year
DOI
Venue
2018
10.1109/HPCS.2018.00136
PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS)
Keywords
Field
DocType
Similarity Join, Big Data, Record Linkage
Data integration,Data mining,Data modeling,Spark (mathematics),Distributed Computing Environment,Load balancing (computing),Computer science,Distributed database,Big data,Scalability
Conference
Citations 
PageRank 
References 
0
0.34
0
Authors
4
Name
Order
Citations
PageRank
Song Zhu1123.18
Luca Gagliardelli222.06
Giovanni Simonini33111.55
Domenico Beneventano446281.96