Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GCP Dataflow: System Lag for streaming from Pub/Sub IO

We use "System Lag" to check the health of our Dataflow jobs. For example if we see an increase in system lag, we will try to see how to bring this metric down. There are few question regarding this metric.

  • 1) What does system lag exactly means?

The maximum time that an item of data has been awaiting processing

Above is what we see in GCP Console when we hit information icon. What does an item of data mean in this case? Stream processing has concept of Windowing, event time vs processing time, watermark, etc. When is an item considered awaiting to be processed? For example is it simply when the message arrives regardless of its state?

  • 2) What is the optimum threshold for this metric?

We try to keep this metric as low as possible, but we don't have any recommendation on how low we should keep it. For example do we have some recommendation such as keeping system lag between 20s to 30s is optimum.

  • 3)How does system lag implicates sinks

How does system lag affect latency of the event itself?

like image 256
user_1357 Avatar asked Mar 01 '17 21:03

user_1357


1 Answers

Depending on the pipeline being executed there are a number of places that elements may be queued up awaiting processing. This is typically when the elements are passed between machines, such as within a GroupByKey, although the PubSub source also reflects the oldest unacked element.

For a given step (sinks included) "System Lag" measures the age of the oldest element in the closest input queue to that step.

It is not unusual for there to be spikes in this measure -- elements are pulled off the queue after they are processed, so if many new elements are delivered it may take a while before the queue is back to a manageable size. What is important is that the system lag goes back down after these spikes.

The latency of a sink depends on several factors:

  1. The rate that elements arrive in the pipeline limits the rate the input watermark advances.
  2. The configuration of windowing and triggers affect how long the pipeline must wait before emitting a given window.
  3. System lag is a measure of how much added delay is currently being introduced by code executing within the pipeline.

It is likely easier to look at the "Data Watermark" of the sink, which reports up to what point in (event) time the sink has been processed.

like image 67
Ben Chambers Avatar answered Oct 22 '22 14:10

Ben Chambers