Title
Learn2Clean: Optimizing the Sequence of Tasks for Web Data Preparation
Abstract
Data cleaning and preparation has been a long-standing challenge in data science to avoid incorrect results and misleading conclusions obtained from dirty data. For a given dataset and a given machine learning-based task, a plethora of data preprocessing techniques and alternative data curation strategies may lead to dramatically different outputs with unequal quality performance. Most current work on data cleaning and automated machine learning, however, focus on developing either cleaning algorithms or user-guided systems or argue to rely on a principled method to select the sequence of data preprocessing steps that can lead to the optimal quality performance of. In this paper, we propose Learn2Clean, a method based on Q-Learning, a model-free reinforcement learning technique that selects, for a given dataset, a ML model, and a quality performance metric, the optimal sequence of tasks for preprocessing the data such that the quality of the ML model result is maximized. As a preliminary validation of our approach in the context of Web data analytics, we present some promising results on data preparation for clustering, regression, and classification on real-world data.
Year
DOI
Venue
2019
10.1145/3308558.3313602
WWW '19: The Web Conference on The World Wide Web Conference WWW 2019
Keywords
Field
DocType
Data cleaning, Principled data preprocessing, Q-Learning, Reinforcement learning
Data mining,Data analysis,Computer science,Performance metric,Data pre-processing,Q-learning,Data curation,Dirty data,Artificial intelligence,Cluster analysis,Machine learning,Reinforcement learning
Conference
ISBN
Citations 
PageRank 
978-1-4503-6674-8
0
0.34
References 
Authors
0
1
Name
Order
Citations
PageRank
Laure Berti-Equille158849.90