Title
Online Person Name Disambiguation with Constraints
Abstract
While many clustering techniques have been successfully applied to the person name disambiguation problem, most do not address two main practical issues: allowing constraints to be added to the clustering process, and allowing the data to be added incrementally without clustering the entire database. Constraints can be particularly useful especially in a system such as a digital library, where users are allowed to make corrections to the disambiguated result. For example, a user correction on a disambiguation result specifying that a record does not belong to an author could be kept as a cannot-link constraint to be used in any future disambiguation (such as when new documents are added). Besides such user corrections, constraints also allow background heuristics to be encoded into the disambiguation process. We propose a constraint-based clustering algorithm for person name disambiguation, based on DBSCAN combined with a pairwise distance based on random forests. We further propose an extension to the density-based clustering algorithm (DBSCAN) to handle online clustering so that the disambiguation process can be done iteratively as new data points are added. Our algorithm utilizes similarity features based on both metadata information and citation similarity. We implement two types of clustering constraints to demonstrate the concept. Experiments on the CiteSeer data show that our model can achieve 0.95 pairwise F1 and 0.79 cluster F1. The presence of constraints also consistently improves the disambiguation result across different combinations of features.
Year
DOI
Venue
2015
10.1145/2756406.2756915
ACM/IEEE Joint Conference on Digital Libraries
Keywords
Field
DocType
Name Entity Recognition, Record Linking, Name Disambiguation, Clustering, Online Disambiguation
Data point,Pairwise comparison,Data mining,Metadata,Fuzzy clustering,Information retrieval,Computer science,Heuristics,Constrained clustering,Cluster analysis,DBSCAN
Conference
ISBN
Citations 
PageRank 
978-1-4503-3594-2
8
0.62
References 
Authors
26
3
Name
Order
Citations
PageRank
Madian Khabsa123718.81
Pucktada Treeratpituk217711.12
C. Lee Giles3111541549.48