I have been reading the Dataflow SDK documentation, trying to find out what happens when data arrives past the watermark in a streaming job.
This page:
https://cloud.google.com/dataflow/model/windowing
indicates that if you use the default window/trigger strategies, then late data will be discarded:
Note: Dataflow's default windowing and trigger strategies discard late data. If you want to ensure that your pipeline handles instances of late data, you'll need to explicitly set .withAllowedLateness when you set your PCollection's windowing strategy and set triggers for your PCollections accordingly.
Yet this page:
https://cloud.google.com/dataflow/model/triggers
indicates that late data will be emitted as a single element PCollection when it arrives late:
The default trigger for a PCollection is event time-based, and emits the results of the window when the system's watermark (Dataflow's notion of when it "should" have all the data) passes the end of the window. The default trigger emits on a repeating basis, meaning that any late data will by definition arrive after the watermark and trip the trigger, causing the late elements to be emitted as they arrive.
So, will late data past the watermark be discarded completely? Or, will it only not be emitted with the other data it would have been windowed with had it arrived in time, and be emitted on its own instead?
The default "windowing and trigger strategies" discard late data. The WindowingStrategy
is an object which consists of windowing, triggering, and a few other parameters such as allowed lateness. The default allowed lateness is 0, so any late data elements are discarded.
The default trigger handles late data. If you take the default WindowingStrategy
and change only the allowed lateness, then you will receive a PCollection
which contains one output pane for all of the on time data, and then a new output pane for approximately every late element.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With