Title
Discovering data quality rules
Abstract
Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code. In this work, we propose a new data-driven tool that can be used within an organization's data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Data quality rules are known to be contextual, so we focus on the discovery of context-dependent rules. Specifically, we search for conditional functional dependencies (CFDs), that is, functional dependencies that hold only over a portion of the data. The output of our tool is a set of functional dependencies together with the context in which they hold (for example, a rule that states for CS graduate courses, the course number and term functionally determines the room and instructor). Since the input to our tool will likely be a dirty database, we also search for CFDs that almost hold. We return these rules together with the non-conformant records (as these are potentially dirty records). We present effective algorithms for discovering CFDs and dirty values in a data instance. Our discovery algorithm searches for minimal CFDs among the data values and prunes redundant candidates. No universal objective measures of data quality or data quality rules are known. Hence, to avoid returning an unnecessarily large number of CFDs and only those that are most interesting, we evaluate a set of interest metrics and present comparative results using real datasets. We also present an experimental study showing the scalability of our techniques.
Year
DOI
Venue
2008
10.14778/1453856.1453980
PVLDB
Keywords
Field
DocType
discovering data quality rule,non-conformant record,functional dependency,dirty data,dirty database,data quality rule,data value,data quality management process,data quality,data instance,data consistency,business rules,context dependent,data quality management
Data mining,Data quality,Conditional functional dependencies,Computer science,Functional dependency,Dirty data,Business rule,Database,Scalability,Data consistency
Journal
Volume
Issue
ISSN
1
1
2150-8097
Citations 
PageRank 
References 
68
2.17
23
Authors
2
Name
Order
Citations
PageRank
Fei Chiang125619.02
Renée J. Miller23545373.59