Abstract | ||
---|---|---|
Set similarity join is an essential operation to find similar pairs of records in data integration and data analytics applications. To cope with the increasing scale of the data, several techniques have been proposed to perform set similarity join using distributed frameworks (e.g. MapReduce). In particular, it is publicly available a MapReduce implementation of the PPJoin, that was experimentally demonstrated as one of the best set similarity join algorithm. However, these techniques produce huge amounts of duplicates in order to perform a successful parallel processing. Moreover, these approaches do not guarantee the load balancing, which generates skewness problem and less scalability of these techniques. To address these problems, we propose a duplicate-free technique called TTJoin, that performs set similarity join efficiently by utilizing an innovative filter derived from the prefix filter. Moreover, we implemented TTJoin on Apache Spark, that is one of the most innovative distributed framework. Several experiments on real-world datasets demonstrate the effectiveness of proposed solution with respect to either traditional TTJoin MapReduce implementation. |
Year | DOI | Venue |
---|---|---|
2018 | 10.1109/HPCS.2018.00136 | PROCEEDINGS 2018 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS) |
Keywords | Field | DocType |
Similarity Join, Big Data, Record Linkage | Data integration,Data mining,Data modeling,Spark (mathematics),Distributed Computing Environment,Load balancing (computing),Computer science,Distributed database,Big data,Scalability | Conference |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Song Zhu | 1 | 12 | 3.18 |
Luca Gagliardelli | 2 | 2 | 2.06 |
Giovanni Simonini | 3 | 31 | 11.55 |
Domenico Beneventano | 4 | 462 | 81.96 |