Abstract | ||
---|---|---|
The World Health Organization (WHO) and drug regulators in many countries maintain databases for adverse drug reaction reports. Data duplication is a significant problem in such databases as reports often come from a variety of sources. Most duplicate detection techniques either have limitations on handling large amount of data or lack effective means to deal with data with imbalanced label distribution. In this paper, we propose a scalable duplicate detection method built on top of Spark to address these problems. Our method uses the kNN (k nearest neighbors) classifier to identify labelled report pairs that are most useful for classifying new report pairs. To deal with the high computational cost of kNN, we partition the labelled data into clusters for parallel computing. We give a method to minimize the crosscluster kNN search. Our experimental results show that the proposed method is able to produce robust duplicate detection results and scalable performance. |
Year | Venue | Field |
---|---|---|
2016 | EDBT | Data deduplication,k-nearest neighbors algorithm,Data mining,Adverse drug reaction,Duplicate detection,Spark (mathematics),Computer science,Classifier (linguistics),Database,Scalability |
DocType | Citations | PageRank |
Conference | 2 | 0.41 |
References | Authors | |
11 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Chen Wang | 1 | 361 | 93.70 |
Sarvnaz Karimi | 2 | 380 | 33.01 |