Title
Selecting suitable configurations for automated link discovery
Abstract
Linking individuals in one dataset to other same individuals in existing datasets is a major problem known as link discovery. Existing automated link discovery techniques make users responsible for selecting suitable properties, distances and transformations, a.k.a. configurations, which is challenging for both researchers and practitioners. Furthermore, failing to provide suitable configurations dramatically increases the complexity of link discovery since many configurations need to be evaluated. Current approaches to help users select proper configurations assume datasets are not heterogeneous or require the existence of a schema or ontology, making them less appealing in the context of Linked Data. In this paper, we present an approach to help users select suitable configurations solely based on data, i.e., no schema or ontology is required. We rely on the concepts of universality and uniqueness, i.e., properties that are present in many individuals of the datasets to link (universality) and do not have repeated objects (uniqueness). We use the concept of singularity to focus on configurations in which only a few individuals are very similar while the rest are very dissimilar. We evaluate our approach using eight commonly-used scenarios, in which, on average, we only suggest 5% of all the possible configurations. Additionally, selected configurations consistently generate links achieving high precision and recall with respect to a ground truth. Finally, we provide a number of guidelines to apply our approach in additional scenarios.
Year
DOI
Venue
2020
10.1145/3341105.3373882
SAC '20: The 35th ACM/SIGAPP Symposium on Applied Computing Brno Czech Republic March, 2020
Keywords
DocType
ISBN
Linked data, link discovery, data integration
Conference
978-1-4503-6866-7
Citations 
PageRank 
References 
1
0.35
0
Authors
2
Name
Order
Citations
PageRank
Carlos R. Rivero111116.25
David Ruiz215220.62