Title
Pitfalls Analyzer: Quality Control for Model-Driven Data Science Pipelines
Abstract
Data science pipelines are a sequence of data processing steps that aim to derive knowledge and insights from raw data. Data science pipeline tools simplify the creation and automation of data science pipelines by providing reusable building blocks that users can drag and drop into their pipelines. Such a graphical, model-driven approach enables users with limited data science expertise to create complex pipelines. However, recent studies show that there exist several data science pitfalls that can yield spurious results and, consequently, misleading insights. Yet, none of the popular pipeline tools have built-in quality control measures to detect these pitfalls. Therefore, in this paper, we propose an approach called Pitfalls Analyzer to detect common pitfalls in data science pipelines. As a proof-of-concept, we implemented a prototype of the Pitfalls Analyzer for KNIME, which is one of the most popular data science pipeline tools. Our prototype is model-driven, since the detection of pitfalls is accomplished using pipelines that were created with KNIME building blocks. To showcase the effectiveness of our approach, we run our prototype on 11 pipelines that were created by KNIME experts for 3 Internet-of-Things (IoT) projects. The results indicate that our prototype flags all and only those instances of the pitfalls that we were able to flag while manually inspecting the pipelines.
Year
DOI
Venue
2019
10.1109/MODELS.2019.00-19
2019 ACM/IEEE 22nd International Conference on Model Driven Engineering Languages and Systems (MODELS)
Keywords
Field
DocType
Data science pipelines, model-driven engineering, quality control, data science pitfalls
Pipeline transport,Systems engineering,Computer science,Spectrum analyzer
Conference
ISBN
Citations 
PageRank 
978-1-7281-2537-4
1
0.36
References 
Authors
0
4
Name
Order
Citations
PageRank
Gopi Krishnan Rajbahadur1101.49
Gustavo Ansaldi Oliva251.47
Ahmed E. Hassan35959287.68
Juergen Dingel460849.06