Title
Parallel Duplicate Detection in Adverse Drug Reaction Databases with Spark.
Abstract
The World Health Organization (WHO) and drug regulators in many countries maintain databases for adverse drug reaction reports. Data duplication is a significant problem in such databases as reports often come from a variety of sources. Most duplicate detection techniques either have limitations on handling large amount of data or lack effective means to deal with data with imbalanced label distribution. In this paper, we propose a scalable duplicate detection method built on top of Spark to address these problems. Our method uses the kNN (k nearest neighbors) classifier to identify labelled report pairs that are most useful for classifying new report pairs. To deal with the high computational cost of kNN, we partition the labelled data into clusters for parallel computing. We give a method to minimize the crosscluster kNN search. Our experimental results show that the proposed method is able to produce robust duplicate detection results and scalable performance.
Year
Venue
Field
2016
EDBT
Data deduplication,k-nearest neighbors algorithm,Data mining,Adverse drug reaction,Duplicate detection,Spark (mathematics),Computer science,Classifier (linguistics),Database,Scalability
DocType
Citations 
PageRank 
Conference
2
0.41
References 
Authors
11
2
Name
Order
Citations
PageRank
Chen Wang136193.70
Sarvnaz Karimi238033.01