Title
Detecting Data Errors with Statistical Constraints.
Abstract
A powerful approach to detecting erroneous data is to check which potentially dirty data records are incompatible with a useru0027s domain knowledge. Previous approaches allow the user to specify domain knowledge in the form of logical constraints (e.g., functional dependency and denial constraints). We extend the constraint-based approach by introducing a novel class of statistical constraints (SCs). An SC treats each column as a random variable, and enforces an independence or dependence relationship between two (or a few) random variables. Statistical constraints are expressive, allowing the user to specify a wide range of domain knowledge, beyond traditional integrity constraints. Furthermore, they work harmoniously with downstream statistical modeling. We develop CODED, an SC-Oriented Data Error Detection system that supports three key tasks: (1) Checking whether an SC is violated or not on a given dataset, (2) Identify the top-k records that contribute the most to the violation of an SC, and (3) Checking whether a set of input SCs have conflicts or not. We present effective solutions for each task. Experiments on synthetic and real-world data illustrate how SCs apply to error detection, and provide evidence that CODED performs better than state-of-the-art approaches.
Year
Venue
DocType
2019
arXiv: Databases
Journal
Volume
Citations 
PageRank 
abs/1902.09711
0
0.34
References 
Authors
38
4
Name
Order
Citations
PageRank
Jing Nathan Yan122.05
Oliver Schulte213425.15
Jiannan Wang3110945.38
Reynold Cheng43069154.13