Title
Models for Distributed, Large Scale Data Cleaning.
Abstract
Poor data quality is a serious and costly problem affecting organizations across all industries. Real data is often dirty, containing missing, erroneous, incomplete, and duplicate values. Declarative data cleaning techniques have been proposed to resolve some of these underlying errors by identifying the inconsistencies and proposing updates to the data. However, much of this work has focused on cleaning data in static environments. Given the Big Data era, modern applications are operating in dynamic data environments where large scale data may be frequently changing. For example, consider data in sensor environments where there is a frequent stream of data arrivals, or financial data of stock prices and trading volumes. Data cleaning in such dynamic environments requires understanding the properties of the incoming data streams, and configuration of system parameters to maximize performance and improved data quality. In this paper, we present a set of queueing models, and analyze the impact of various system parameters on the output quality of a data cleaning system and its performance. We assume random routing in our models, and consider a variety of system configurations that reflect potential data cleaning scenarios. We present experimental results showing that our models are able to closely predict expected system behaviour.
Year
DOI
Venue
2014
10.1007/978-3-319-13186-3_34
Lecture Notes in Artificial Intelligence
Keywords
Field
DocType
Data quality,Distributed data cleaning,Queueing models
Data mining,Data stream mining,Data quality,Random routing,Computer science,Queueing theory,Dynamic data,Big data
Conference
Volume
ISSN
Citations 
8643
0302-9743
1
PageRank 
References 
Authors
0.34
10
3
Name
Order
Citations
PageRank
Vincent Maccio1363.52
Fei Chiang225619.02
Douglas G. Down337037.04