Title
Cleaning uncertain data with a noisy crowd
Abstract
Uncertain data has been emerged as an important problem in database systems due to the imprecise nature of many applications. To handle the uncertainty, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with confidence. However, the uncertainty may propagate, hence the returned results from a query or mining process may not be useful. In this paper, we leverage the power of crowdsourcing for cleaning uncertain data. Specifically, we will design a set of Human Intelligence Tasks (HIT)s to ask a crowd to improve the quality of uncertain data. Each HIT is associated with a cost, thus, we need to design solutions to maximize the data quality with minimal number of HITs. There are two obstacles for this non-trivial optimization - first, the crowd has a probability to return incorrect answers; second, the HITs decomposed from uncertain data are often correlated. These two obstacles lead to very high computational cost for selecting the optimal set of HITs. Thus, in this paper, we have addressed these challenges by designing an effective approximation algorithm and an efficient heuristic solution. To further improve the efficiency, we derive tight lower and upper bounds, which are used for effective filtering and estimation. We have verified the solutions with extensive experiments on both a simulated crowd and a real crowdsourcing platform.
Year
DOI
Venue
2015
10.1109/ICDE.2015.7113268
Data Engineering
Keywords
Field
DocType
approximation theory,data mining,database management systems,optimisation,query processing,hit,approximation algorithm,crowdsourcing platform,database system,human intelligence tasks,mining process,noisy crowd,nontrivial optimization,probabilistic database,query process,uncertain data cleaning,crowdsourcing,uncertainty,semantics,accuracy,entropy
Approximation algorithm,Data mining,Heuristic,Ask price,Data quality,Crowdsourcing,Computer science,Filter (signal processing),Uncertain data,Probabilistic logic,Database
Conference
ISSN
Citations 
PageRank 
1084-4627
15
0.49
References 
Authors
29
4
Name
Order
Citations
PageRank
Chen Jason Zhang11618.28
Lei Chen26239395.84
Yongxin Tong3109556.54
Zheng Liu48410.09