Title
Determining the Real Data Completeness of a Relational Dataset.
Abstract
Low quality of data is a serious problem in the new era of big data, which can severely reduce the usability of data, mislead or bias the querying, analyzing and mining, and leads to huge loss. Incomplete data is common in low quality data, and it is necessary to determine the data completeness of a dataset to provide hints for follow-up operations on it. Little existing work focuses on the completeness of a dataset, and such work views all missing values as unknown values. In this paper, we study how to determine real data completeness of a relational dataset. By taking advantage of given functional dependencies, we aim to determine some missing attribute values by other tuples and capture the really missing attribute cells. We propose a data completeness model, formalize the problem of determining the real data completeness of a relational dataset, and give a lower bound of the time complexity of this problem. Two optimal algorithms to determine the data completeness of a dataset for different cases are proposed. We empirically show the effectiveness and the scalability of our algorithms on both real-world data and synthetic data.
Year
DOI
Venue
2016
10.1007/s11390-016-1659-x
J. Comput. Sci. Technol.
Keywords
Field
DocType
data quality, data completeness, functional dependency, data completeness model, optimal algorithm
Data mining,Data quality,Computer science,Tuple,Synthetic data,Missing data,Time complexity,Completeness (statistics),Big data,Scalability
Journal
Volume
Issue
ISSN
31
4
1860-4749
Citations 
PageRank 
References 
1
0.35
21
Authors
3
Name
Order
Citations
PageRank
Yongnan Liu110.35
Jianzhong Li23196304.46
Zhaonian Zou333115.78