Title
Context-aware data quality assessment for big data.
Abstract
Big data changed the way in which we collect and analyze data. In particular, the amount of available information is constantly growing and organizations rely more and more on data analysis in order to achieve their competitive advantage. However, such amount of data can create a real value only if combined with quality: good decisions and actions are the results of correct, reliable and complete data. In such a scenario, methods and techniques for the Data Quality assessment can support the identification of suitable data to process. If for traditional database numerous assessment methods are proposed, in the Big Data scenario new algorithms have to be designed in order to deal with novel requirements related to variety, volume and velocity issues. In particular, in this paper we highlight that dealing with heterogeneous sources requires an adaptive approach able to trigger the suitable quality assessment methods on the basis of the data type and context in which data have to be used. Furthermore, we show that in some situations it is not possible to evaluate the quality of the entire dataset due to performance and time constraints. For this reason, we suggest to focus the Data Quality assessment only on a portion of the dataset and to take into account the consequent loss of accuracy by introducing a confidence factor as a measure of the reliability of the quality assessment procedure. We propose a methodology to build a Data Quality adapter module, which selects the best configuration for the Data Quality assessment based on the user main requirements: time minimization, confidence maximization, and budget minimization. Experiments are performed by considering real data gathered from a smart city case study.
Year
DOI
Venue
2018
10.1016/j.future.2018.07.014
Future Generation Computer Systems
Keywords
Field
DocType
00-01,99-00
Data mining,Data quality,Computer science,Competitive advantage,Adapter (computing),Data type,Minification,Smart city,Big data,Maximization,Distributed computing
Journal
Volume
ISSN
Citations 
89
0167-739X
1
PageRank 
References 
Authors
0.37
16
4
Name
Order
Citations
PageRank
D. Ardagna1295.80
Cinzia Cappiello297864.35
Walter Samá330.74
Monica Vitali46411.82