SampleClean: Fast and Reliable Analytics on Dirty Data. - Citegraph

Paper Info

Title
SampleClean: Fast and Reliable Analytics on Dirty Data.

Abstract
An important obstacle to accurate data analytics is dirty data in the form of missing, duplicate, incorrect, or inconsistent values. In the SampleClean project, we have developed a new suite of techniques to estimate the results of queries when only a sample of data can be cleaned. Some forms of data corruption, such as duplication, can affect sampling probabilities, and thus, new techniques have to be designed to ensure correctness of the approximate query results. We first describe our initial project on computing statistically bounded estimates of sum, count, and avg queries from samples of cleaned data. We subsequently explored how the same techniques could apply to other problems in database research, namely, materialized view maintenance. To avoid expensive incremental maintenance, we maintain only a sample of rows in a view, and then leverage SampleClean to approximate aggregate query results. Finally, we describe our work on a gradient-descent algorithm that extends the key ideas to the increasingly common Machine Learning-based analytics.

Year	Venue	Field
2015	IEEE Data Eng. Bull.	Row,Data mining,Data analysis,Computer science,Correctness,Sampling (statistics),Data Corruption,Dirty data,Analytics,Materialized view
DocType	Volume	Issue
Journal	38	3
Citations	PageRank	References
10	0.58	25
Authors
7

Authors (7 rows)

Cited by (10 rows)

References (25 rows)

Name	Order	Citations	PageRank
S. Krishnan	1	391	36.25
Jiannan Wang	2	1109	45.38
Michael J. Franklin	3	17423	1681.10
Ken Goldberg	4	3785	369.80
Tim Kraska	5	2226	133.57
Tova Milo	6	4074	1052.72
Eugene Wu 0002	7	26	2.87

1