Title
Pivot-Based Similarity Wide-Joins Fostering Near-Duplicate Detection.
Abstract
Monitoring systems targeting to improve decision making in emergency scenarios are currently benefiting from crowdsourcing information. The main issue with such kind of data is that the gathered reports quickly become too similar among themselves. Hence, too much similar reports, namely near-duplicates, do not add valuable knowledge to assist crisis control committees in their decision making tasks. The current approaches to detect near-duplicates are usually based on a twofold processing, where the first phase relies on similarity queries or clustering techniques, whereas the second and most computationally costly phase refines the result from the first one. Aimed at reducing that cost and also improving the ability of near-duplication detection, we developed a framework model based on the similarity wide-join database operator. This paper extends the wide-join definition empowering it to surpass its restrictions and provides an efficient algorithm based on pivots that speeds up the entire process, whereas enabling to retrieve the most similar elements in a single-pass. We also investigate alternatives and propose efficient algorithms to choose the pivots. Experiments using real datasets show that our framework is up to three orders of magnitude faster than the competing techniques in the literature, whereas it also improves the quality of the result in about 35%.
Year
DOI
Venue
2016
10.1007/978-3-319-62386-3_4
Lecture Notes in Business Information Processing
Keywords
Field
DocType
Similarity search,Similarity join,Query operators,Wide-join,Near-duplicate detection
Data mining,Joins,Duplicate detection,Monitoring system,Computer science,Crowdsourcing,Operator (computer programming),Cluster analysis,Nearest neighbor search
Conference
Volume
ISSN
Citations 
291
1865-1348
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Luiz Olmes Carvalho153.56
Lucio F. D. Santos2256.76
Agma J. M. Traina31024153.61
Caetano Traina Jr.41052137.26