Title | ||
---|---|---|
Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow |
Abstract | ||
---|---|---|
AbstractStreaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks, a key tool for reasoning about temporal completeness in infinite streams.First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches:• Computing a single correct answer, as in notifications.• Reasoning about a lack of data, as in dip detection.• Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models.• Safely and punctually garbage collecting obsolete inputs and intermediate state.• Surfacing a reliable signal of overall pipeline health.Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow. |
Year | DOI | Venue |
---|---|---|
2021 | 10.14778/3476311.3476389 | Hosted Content |
DocType | Volume | Issue |
Journal | 14 | 12 |
ISSN | Citations | PageRank |
2150-8097 | 0 | 0.34 |
References | Authors | |
0 | 8 |
Name | Order | Citations | PageRank |
---|---|---|---|
Edmon Begoli | 1 | 0 | 0.34 |
Tyler Akidau | 2 | 1 | 0.72 |
Slava Chernyak | 3 | 280 | 9.90 |
Fabian Hueske | 4 | 0 | 0.34 |
Kathryn Knight | 5 | 2 | 2.12 |
Kenneth W. Knowles | 6 | 2 | 1.45 |
Daniel Mills | 7 | 0 | 0.34 |
Dan Sotolongo | 8 | 0 | 0.34 |