Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow - Citegraph

Paper Info

Title
Watermarks in stream processing systems: semantics and comparative analysis of Apache Flink and Google cloud dataflow

Abstract
AbstractStreaming data processing is an exercise in taming disorder: from oftentimes huge torrents of information, we hope to extract powerful and timely analyses. But when dealing with streaming data, the unbounded and temporally disordered nature of real-world streams introduces a critical challenge: how does one reason about the completeness of a stream that never ends? In this paper, we present a comprehensive definition and analysis of watermarks, a key tool for reasoning about temporal completeness in infinite streams.First, we describe what watermarks are and why they are important, highlighting how they address a suite of stream processing needs that are poorly served by eventually-consistent approaches:• Computing a single correct answer, as in notifications.• Reasoning about a lack of data, as in dip detection.• Performing non-incremental processing over temporal subsets of an infinite stream, as in statistical anomaly detection with cubic spline models.• Safely and punctually garbage collecting obsolete inputs and intermediate state.• Surfacing a reliable signal of overall pipeline health.Second, we describe, evaluate, and compare the semantically equivalent, but starkly different, watermark implementations in two modern stream processing engines: Apache Flink and Google Cloud Dataflow.

Year	DOI	Venue
2021	10.14778/3476311.3476389	Hosted Content
DocType	Volume	Issue
Journal	14	12
ISSN	Citations	PageRank
2150-8097	0	0.34
References	Authors
0	8

Authors (8 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Edmon Begoli	1	0	0.34
Tyler Akidau	2	1	0.72
Slava Chernyak	3	280	9.90
Fabian Hueske	4	0	0.34
Kathryn Knight	5	2	2.12
Kenneth W. Knowles	6	2	1.45
Daniel Mills	7	0	0.34
Dan Sotolongo	8	0	0.34

1