ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models. - Citegraph

Paper Info

Title
ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models.

Abstract
Data cleaning is often an important step to ensure that predictive models, such as regression and classification, are not affected by systematic errors such as inconsistent, out-of-date, or outlier data. Identifying dirty data is often a manual and iterative process, and can be challenging on large datasets. However, many data cleaning workflows can introduce subtle biases into the training processes due to violation of independence assumptions. We propose ActiveClean, a progressive cleaning approach where the model is updated incrementally instead of re-training and can guarantee accuracy on partially cleaned data. ActiveClean supports a popular class of models called convex loss models (e.g., linear regression and SVMs). ActiveClean also leverages the structure of a useru0027s model to prioritize cleaning those records likely to affect the results. We evaluate ActiveClean on five real-world datasets UCI Adult, UCI EEG, MNIST, Dollars For Docs, and WorldBank with both real and synthetic errors. Our results suggest that our proposed optimizations can improve model accuracy by up-to 2.5x for the same amount of data cleaned. Furthermore for a fixed cleaning budget and on all real dirty datasets, ActiveClean returns more accurate models than uniform sampling and Active Learning.

Year	Venue	Field
2016	arXiv: Databases	Data mining,MNIST database,Computer science,Dirty data,Artificial intelligence,Workflow,Linear regression,Active learning,Iterative and incremental development,Support vector machine,Outlier,Database,Machine learning
DocType	Volume	Citations
Journal	abs/1601.03797	6
PageRank	References	Authors
0.47	20	5

Authors (5 rows)

Cited by (6 rows)

References (20 rows)

Name	Order	Citations	PageRank
S. Krishnan	1	391	36.25
Jiannan Wang	2	1109	45.38
Eugene Wu 0002	3	26	2.87
Michael J. Franklin	4	17423	1681.10
Ken Goldberg	5	3785	369.80

1