Parallel Duplicate Detection in Adverse Drug Reaction Databases with Spark. - Citegraph

Paper Info

Title
Parallel Duplicate Detection in Adverse Drug Reaction Databases with Spark.

Abstract
The World Health Organization (WHO) and drug regulators in many countries maintain databases for adverse drug reaction reports. Data duplication is a significant problem in such databases as reports often come from a variety of sources. Most duplicate detection techniques either have limitations on handling large amount of data or lack effective means to deal with data with imbalanced label distribution. In this paper, we propose a scalable duplicate detection method built on top of Spark to address these problems. Our method uses the kNN (k nearest neighbors) classifier to identify labelled report pairs that are most useful for classifying new report pairs. To deal with the high computational cost of kNN, we partition the labelled data into clusters for parallel computing. We give a method to minimize the crosscluster kNN search. Our experimental results show that the proposed method is able to produce robust duplicate detection results and scalable performance.

Year	Venue	Field
2016	EDBT	Data deduplication,k-nearest neighbors algorithm,Data mining,Adverse drug reaction,Duplicate detection,Spark (mathematics),Computer science,Classifier (linguistics),Database,Scalability
DocType	Citations	PageRank
Conference	2	0.41
References	Authors
11	2

Authors (2 rows)

Cited by (2 rows)

References (11 rows)

Name	Order	Citations	PageRank
Chen Wang	1	361	93.70
Sarvnaz Karimi	2	380	33.01

1