Title
ERACER: a database approach for statistical inference and data cleaning
Abstract
Real-world databases often contain syntactic and semantic errors, in spite of integrity constraints and other safety measures incorporated into modern DBMSs. We present ERACER, an iterative statistical framework for inferring missing information and correcting such errors automatically. Our approach is based on belief propagation and relational dependency networks, and includes an efficient approximate inference algorithm that is easily implemented in standard DBMSs using SQL and user defined functions. The system performs the inference and cleansing tasks in an integrated manner, using shrinkage techniques to infer correct values accurately even in the presence of dirty data. We evaluate the proposed methods empirically on multiple synthetic and real-world data sets. The results show that our framework achieves accuracy comparable to a baseline statistical method using Bayesian networks with exact inference. However, our framework has wider applicability than the Bayesian network baseline, due to its ability to reason with complex, cyclic relational dependencies.
Year
DOI
Venue
2010
10.1145/1807167.1807178
SIGMOD Conference
Keywords
Field
DocType
cyclic relational dependency,efficient approximate inference algorithm,exact inference,dirty data,bayesian network,bayesian network baseline,modern dbmss,baseline statistical method,statistical inference,real-world data set,iterative statistical framework,database approach,belief propagation,system performance,data cleaning,linear regression,outlier detection,integrity constraints
Data mining,Frequentist inference,Computer science,Approximate inference,Statistical inference,Artificial intelligence,Bayesian statistics,Adaptive neuro fuzzy inference system,Fiducial inference,Inference,Bayesian network,Machine learning,Database
Conference
Citations 
PageRank 
References 
61
1.88
15
Authors
3
Name
Order
Citations
PageRank
Chris Mayfield133518.86
Jennifer Neville22092117.45
Sunil Prabhakar32664152.75