Abstract | ||
---|---|---|
Monitoring systems targeting to improve decision making in emergency scenarios are currently benefiting from crowdsourcing information. The main issue with such kind of data is that the gathered reports quickly become too similar among themselves. Hence, too much similar reports, namely near-duplicates, do not add valuable knowledge to assist crisis control committees in their decision making tasks. The current approaches to detect near-duplicates are usually based on a twofold processing, where the first phase relies on similarity queries or clustering techniques, whereas the second and most computationally costly phase refines the result from the first one. Aimed at reducing that cost and also improving the ability of near-duplication detection, we developed a framework model based on the similarity wide-join database operator. This paper extends the wide-join definition empowering it to surpass its restrictions and provides an efficient algorithm based on pivots that speeds up the entire process, whereas enabling to retrieve the most similar elements in a single-pass. We also investigate alternatives and propose efficient algorithms to choose the pivots. Experiments using real datasets show that our framework is up to three orders of magnitude faster than the competing techniques in the literature, whereas it also improves the quality of the result in about 35%. |
Year | DOI | Venue |
---|---|---|
2016 | 10.1007/978-3-319-62386-3_4 | Lecture Notes in Business Information Processing |
Keywords | Field | DocType |
Similarity search,Similarity join,Query operators,Wide-join,Near-duplicate detection | Data mining,Joins,Duplicate detection,Monitoring system,Computer science,Crowdsourcing,Operator (computer programming),Cluster analysis,Nearest neighbor search | Conference |
Volume | ISSN | Citations |
291 | 1865-1348 | 0 |
PageRank | References | Authors |
0.34 | 0 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Luiz Olmes Carvalho | 1 | 5 | 3.56 |
Lucio F. D. Santos | 2 | 25 | 6.76 |
Agma J. M. Traina | 3 | 1024 | 153.61 |
Caetano Traina Jr. | 4 | 1052 | 137.26 |