Title
Diagnosing Machine Learning Pipelines with Fine-grained Lineage.
Abstract
We present the Hippo system to enable the diagnosis of distributed machine learning (ML) pipelines by leveraging fine-grained data lineage. Hippo exposes a concise yet powerful API, derived from primitive lineage types, to capture fine-grained data lineage for each data transformation. It records the input datasets, the output datasets and the cell-level mapping between them. It also collects sufficient information that is needed to reproduce the computation. Hippo efficiently enables common ML diagnosis operations such as code debugging, result analysis, data anomaly removal, and computation replay. By exploiting the metadata separation and high-order function encoding strategies, we observe an O(10^3)x total improvement in lineage storage efficiency vs. the baseline of cell-wise mapping recording while maintaining the lineage integrity. Hippo can answer the real use case lineage queries within a few seconds, which is low enough to enable interactive diagnosis of ML pipelines.
Year
DOI
Venue
2017
10.1145/3078597.3078603
HPDC
Field
DocType
Citations 
Computer science,Data lineage,Real-time computing,Artificial intelligence,Distributed computing,Computation,Metadata,Pipeline transport,Parallel computing,Storage efficiency,Machine learning,Debugging,Encoding (memory)
Conference
4
PageRank 
References 
Authors
0.39
28
3
Name
Order
Citations
PageRank
Zhao Zhang117710.37
Evan R. Sparks239718.79
Michael J. Franklin3174231681.10