Title
When to conduct probabilistic linkage vs. deterministic linkage? A simulation study
Abstract
Display Omitted We evaluated linkage methods in simulations that represent real-life challenges.Probabilistic linkage was a more accurate method in poorer quality (more errors) data.Deterministic linkage was an equally valid and faster method in high quality data.Difference in two methods narrowed as numbers of unmatchable in file increased.Both methods performed poorly if linkage rules only had low discriminative power. IntroductionWhen unique identifiers are unavailable, successful record linkage depends greatly on data quality and types of variables available. While probabilistic linkage theoretically captures more true matches than deterministic linkage by allowing imperfection in identifiers, studies have shown inconclusive results likely due to variations in data quality, implementation of linkage methodology and validation method. The simulation study aimed to understand data characteristics that affect the performance of probabilistic vs. deterministic linkage. MethodsWe created ninety-six scenarios that represent real-life situations using non-unique identifiers. We systematically introduced a range of discriminative power, rate of missing and error, and file size to increase linkage patterns and difficulties. We assessed the performance difference of linkage methods using standard validity measures and computation time. ResultsAcross scenarios, deterministic linkage showed advantage in PPV while probabilistic linkage showed advantage in sensitivity. Probabilistic linkage uniformly outperformed deterministic linkage as the former generated linkages with better trade-off between sensitivity and PPV regardless of data quality. However, with low rate of missing and error in data, deterministic linkage performed not significantly worse. The implementation of deterministic linkage in SAS took less than 1min, and probabilistic linkage took 2min to 2h depending on file size. DiscussionOur simulation study demonstrated that the intrinsic rate of missing and error of linkage variables was key to choosing between linkage methods. In general, probabilistic linkage was a better choice, but for exceptionally good quality data (
Year
DOI
Venue
2015
10.1016/j.jbi.2015.05.012
Journal of Biomedical Informatics
Keywords
Field
DocType
Comparative validity,Deterministic linkage,Probabilistic linkage,Record linkage,Simulation study
Data mining,Record linkage,Linkage (mechanical),Data quality,Identifier,Computer science,File size,Probabilistic logic,Statistics,Discriminative model,Unique identifier
Journal
Volume
Issue
ISSN
56
C
1532-0464
Citations 
PageRank 
References 
4
0.47
4
Authors
4
Name
Order
Citations
PageRank
Ying Zhu140.47
Yutaka Matsuyama240.47
Yasuo Ohashi340.81
Soko Setoguchi440.81