Detecting Data Errors with Statistical Constraints. - Citegraph

Paper Info

Title
Detecting Data Errors with Statistical Constraints.

Abstract
A powerful approach to detecting erroneous data is to check which potentially dirty data records are incompatible with a useru0027s domain knowledge. Previous approaches allow the user to specify domain knowledge in the form of logical constraints (e.g., functional dependency and denial constraints). We extend the constraint-based approach by introducing a novel class of statistical constraints (SCs). An SC treats each column as a random variable, and enforces an independence or dependence relationship between two (or a few) random variables. Statistical constraints are expressive, allowing the user to specify a wide range of domain knowledge, beyond traditional integrity constraints. Furthermore, they work harmoniously with downstream statistical modeling. We develop CODED, an SC-Oriented Data Error Detection system that supports three key tasks: (1) Checking whether an SC is violated or not on a given dataset, (2) Identify the top-k records that contribute the most to the violation of an SC, and (3) Checking whether a set of input SCs have conflicts or not. We present effective solutions for each task. Experiments on synthetic and real-world data illustrate how SCs apply to error detection, and provide evidence that CODED performs better than state-of-the-art approaches.

Year	Venue	DocType
2019	arXiv: Databases	Journal
Volume	Citations	PageRank
abs/1902.09711	0	0.34
References	Authors
38	4

Authors (4 rows)

Cited by (0 rows)

References (38 rows)

Name	Order	Citations	PageRank
Jing Nathan Yan	1	2	2.05
Oliver Schulte	2	134	25.15
Jiannan Wang	3	1109	45.38
Reynold Cheng	4	3069	154.13

1