Title
Influence-based provenance for dataflow applications with taint propagation
Abstract
Debugging big data analytics often requires a root cause analysis to pinpoint the precise culprit records in an input dataset responsible for incorrect or anomalous output. Existing debugging or data provenance approaches do not track fine-grained control and data flows in user-defined application code; thus, the returned culprit data is often too large for manual inspection and expensive post-mortem analysis is required. We design FlowDebug to identify a highly precise set of input records based on two key insights. First, FlowDebug precisely tracks control and data flow within user-defined functions to propagate taints at a fine-grained level by inserting custom data abstractions through automated source to source transformation. Second, it introduces a novel notion of influence-based provenance for many-to-one dependencies to prioritize which input records are more responsible than others by analyzing the semantics of a user-defined function used for aggregation. By design, our approach does not require any modification to the framework's runtime and can be applied to existing applications easily. FlowDebug significantly improves the precision of debugging results by up to 99.9 percentage points and avoids repetitive re-runs required for post-mortem analysis by a factor of 33 while incurring an instrumentation overhead of 0.4X - 6.1X on vanilla Spark.
Year
DOI
Venue
2020
10.1145/3419111.3421292
SoCC '20: ACM Symposium on Cloud Computing Virtual Event USA October, 2020
DocType
ISBN
Citations 
Conference
978-1-4503-8137-6
1
PageRank 
References 
Authors
0.35
0
3
Name
Order
Citations
PageRank
Jason Teoh110.35
Muhammad Ali Gulzar29010.07
Miryung Kim3185682.00