Title
Test-driven evaluation of linked data quality
Abstract
Linked Open Data (LOD) comprises an unprecedented volume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowdsourced or extracted data of often relatively low quality. We present a methodology for test-driven quality assessment of Linked Data, which is inspired by test-driven software development. We argue that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality. We present a methodology for assessing the quality of linked data resources, based on a formalization of bad smells and data quality problems. Our formalization employs SPARQL query templates, which are instantiated into concrete quality test case queries. Based on an extensive survey, we compile a comprehensive library of data quality test case patterns. We perform automatic test case instantiation based on schema constraints or semi-automatically enriched schemata and allow the user to generate specific test case instantiations that are applicable to a schema or dataset. We provide an extensive evaluation of five LOD datasets, manual test case instantiation for five schemas and automatic test case instantiations for all available schemata registered with Linked Open Vocabularies (LOV). One of the main advantages of our approach is that domain specific semantics can be encoded in the data quality test cases, thus being able to discover data quality problems beyond conventional quality heuristics.
Year
DOI
Venue
2014
10.1145/2566486.2568002
WWW
Keywords
Field
DocType
low quality,test-driven evaluation,automatic test case instantiations,data quality test case,test-driven quality assessment,data quality problem,automatic test case instantiation,concrete quality test case,conventional quality heuristics,manual test case instantiation,varying quality,linked data,data quality
Ontology (information science),Data mining,World Wide Web,Data quality,Computer science,Linked data,SPARQL,Test case,Data model,Schema (psychology),Software development
Conference
Citations 
PageRank 
References 
93
3.90
24
Authors
7
Name
Order
Citations
PageRank
Dimitris Kontokostas149031.79
Patrick Westphal21327.98
Sören Auer35711418.56
Sebastian Hellmann42007130.09
Jens Lehmann55375355.08
Roland Cornelissen61035.04
Amrapali Zaveri736824.37