Abstract | ||
---|---|---|
A powerful approach to detecting erroneous data is to check which potentially dirty data records are incompatible with a useru0027s domain knowledge. Previous approaches allow the user to specify domain knowledge in the form of logical constraints (e.g., functional dependency and denial constraints). We extend the constraint-based approach by introducing a novel class of statistical constraints (SCs). An SC treats each column as a random variable, and enforces an independence or dependence relationship between two (or a few) random variables. Statistical constraints are expressive, allowing the user to specify a wide range of domain knowledge, beyond traditional integrity constraints. Furthermore, they work harmoniously with downstream statistical modeling. We develop CODED, an SC-Oriented Data Error Detection system that supports three key tasks: (1) Checking whether an SC is violated or not on a given dataset, (2) Identify the top-k records that contribute the most to the violation of an SC, and (3) Checking whether a set of input SCs have conflicts or not. We present effective solutions for each task. Experiments on synthetic and real-world data illustrate how SCs apply to error detection, and provide evidence that CODED performs better than state-of-the-art approaches. |
Year | Venue | DocType |
---|---|---|
2019 | arXiv: Databases | Journal |
Volume | Citations | PageRank |
abs/1902.09711 | 0 | 0.34 |
References | Authors | |
38 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Jing Nathan Yan | 1 | 2 | 2.05 |
Oliver Schulte | 2 | 134 | 25.15 |
Jiannan Wang | 3 | 1109 | 45.38 |
Reynold Cheng | 4 | 3069 | 154.13 |