Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google Dataflow late data

Tags:

I have been reading the Dataflow SDK documentation, trying to find out what happens when data arrives past the watermark in a streaming job.

This page:

https://cloud.google.com/dataflow/model/windowing

indicates that if you use the default window/trigger strategies, then late data will be discarded:

Note: Dataflow's default windowing and trigger strategies discard late data. If you want to ensure that your pipeline handles instances of late data, you'll need to explicitly set .withAllowedLateness when you set your PCollection's windowing strategy and set triggers for your PCollections accordingly.

Yet this page:

https://cloud.google.com/dataflow/model/triggers

indicates that late data will be emitted as a single element PCollection when it arrives late:

The default trigger for a PCollection is event time-based, and emits the results of the window when the system's watermark (Dataflow's notion of when it "should" have all the data) passes the end of the window. The default trigger emits on a repeating basis, meaning that any late data will by definition arrive after the watermark and trip the trigger, causing the late elements to be emitted as they arrive.

So, will late data past the watermark be discarded completely? Or, will it only not be emitted with the other data it would have been windowed with had it arrived in time, and be emitted on its own instead?

like image 842
LeeW Avatar asked May 16 '16 04:05

LeeW


1 Answers

The default "windowing and trigger strategies" discard late data. The WindowingStrategy is an object which consists of windowing, triggering, and a few other parameters such as allowed lateness. The default allowed lateness is 0, so any late data elements are discarded.

The default trigger handles late data. If you take the default WindowingStrategy and change only the allowed lateness, then you will receive a PCollection which contains one output pane for all of the on time data, and then a new output pane for approximately every late element.

like image 54
Ben Chambers Avatar answered Oct 13 '22 00:10

Ben Chambers