Title
QDflows: A System Driven by Knowledge Bases for Designing Quality-Aware Data flows
Abstract
In the big data era, data integration is becoming increasingly important. It is usually handled by data flows processes that extract, transform, and clean data from several sources, and populate the data integration system (DIS). Designing data flows is facing several challenges. In this article, we deal with data quality issues such as (1) specifying a set of quality rules, (2) enforcing them on the data flow pipeline to detect violations, and (3) producing accurate repairs for the detected violations. We propose QDflows, a system for designing quality-aware data flows that considers the following as input: (1) a high-quality knowledge base (KB) as the global schema of integration, (2) a set of data sources and a set of validated users’ requirements, (3) a set of defined mappings between data sources and the KB, and (4) a set of quality rules specified by users. QDflows uses an ontology to design the DIS schema. It offers the ability to define the DIS ontology as a module of the knowledge base, based on validated users’ requirements. The DIS ontology model is then extended with multiple types of quality rules specified by users. QDflows extracts and transforms data from sources to populate the DIS. It detects violations of quality rules enforced on the data flows, constructs repair patterns, searches for horizontal and vertical matches in the knowledge base, and performs an automatic repair when possible or generates possible repairs. It interactively involves users to validate the repair process before loading the clean data into the DIS. Using real-life and synthetic datasets, the DBpedia and Yago knowledge bases, we experimentally evaluate the generality, effectiveness, and efficiency of QDflows. We also showcase an interactive tool implementing our system.
Year
DOI
Venue
2017
10.1145/3064173
J. Data and Information Quality
Keywords
Field
DocType
Data flows,data quality,graph-based repairing,knowledge bases
Data integration,Data mining,Ontology,Ontology-based data integration,Data quality,Information retrieval,Computer science,Knowledge base,Big data,Schema (psychology),Data flow diagram
Journal
Volume
Issue
ISSN
8
3-4
1936-1955
Citations 
PageRank 
References 
0
0.34
61
Authors
3
Name
Order
Citations
PageRank
Sabrina Abdellaoui100.68
Fahima Nader243.75
Rachid Chalal301.01