Title
Data Context Informed Data Wrangling
Abstract
The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process have been carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. Cost-effective data wrangling processes need to ensure that data wrangling steps benefit from automation wherever possible. In this paper, we define a methodology to fully automate an end-to-end data wrangling process incorporating data context, which associates portions of a target schema with potentially spurious extensional data of types that are commonly available. Instance-based evidence together with data profiling paves the way to inform automation in several steps within the wrangling process, specifically, matching, mapping validation, value format transformation, and data repair. The approach is evaluated with real estate data showing substantial improvements in the results of automated wrangling.
Year
DOI
Venue
2017
10.1109/BigData.2017.8258015
2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)
Keywords
DocType
ISSN
Data Wrangling, Data Context, Data Integration
Conference
2639-1589
Citations 
PageRank 
References 
0
0.34
0
Authors
9
Name
Order
Citations
PageRank
Martin Koehler1568.05
Alex Bogatu231.75
Cristina Civili351.77
Nikolaos Konstantinou48810.73
Edward Abel5244.85
Alvaro A. A. Fernandes6143.65
John A. Keane769592.81
Leonid Libkin8222.88
N. W. Paton915241.45