How Improve Set Similarity Join Based On Prefix Approach In Distributed Environment - Citegraph

Paper Info

Title
How Improve Set Similarity Join Based On Prefix Approach In Distributed Environment

Abstract
Set similarity join is an essential operation to find similar pairs of records in data integration and data analytics applications. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity join using distributed frameworks (e.g. MapReduce). In particular, it is publicly available a MapReduce implementation of the PPJoin, that was experimentally demonstrated as one of the best set similarity join algorithm. However, these techniques produce huge amounts of duplicates in order to perform a successful parallel processing. Moreover, these approaches do not guarantee the load balancing, which generates skewness problem and less scalability of these techniques. To address these problems, we propose a duplicate-free technique called TTJoin, that performs set similarity join efficiently by utilizing an innovative filter derived from the prefix filter. Moreover, we implemented TTJoin on Apache Spark, that is one of the most innovative distributed framework. Several experiments on real-world datasets demonstrate the effectiveness of proposed solution with respect to either traditional TTJoin MapReduce implementation.

Year	DOI	Venue
2018	10.1109/HPCS.2018.00136	PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS)
Keywords	Field	DocType
Similarity Join, Big Data, Record Linkage	Data integration,Data mining,Data modeling,Spark (mathematics),Distributed Computing Environment,Load balancing (computing),Computer science,Distributed database,Big data,Scalability	Conference
Citations	PageRank	References
0	0.34	0
Authors
4

Authors (4 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Song Zhu	1	12	3.18
Luca Gagliardelli	2	2	2.06
Giovanni Simonini	3	31	11.55
Domenico Beneventano	4	462	81.96

1