Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

autoscaling in Google Cloud Dataflow

We have a streaming pipeline that we have enabled autoscaling on. Generally, one worker is enough to process the incoming data, but we want to automatically increase the number of workers if there is a backlog.

Our pipeline reads from Pubsub, and writes batches to BigQuery using load jobs every 3 minutes. We ran this pipeline starting with one worker, publishing twice as much data to pubsub as one worker could consume. After 2 hours, autoscaling had still not kicked in, so the backlog would have been about 1 hour's worth of data. This seems rather poor given that autoscaling aims to keep the backlog under 10 seconds (according to this SO answer).

The document here says that autoscaling for streaming jobs is in beta, and is known to be course-grained if the sinks are high-latency. And yeah, I guess doing BigQuery batches every 3 minutes counts as high-latency! Is there any progress being made on improving this autoscaling algorithm?

Are there any work-arounds we can do in the meantime, such as measuring throughput at a different point in the pipeline? I couldn't find any documentation on how the throughput gets reported to the autoscaling system.

like image 827
Chris Heath Avatar asked Jun 28 '18 17:06

Chris Heath


1 Answers

The back log is created by unacknowledged messages, I guess that you are using pull subscriptions. If a message takes longer to process than to acknowledge, it will be resend as per the at-least-once delivery from Pub/Sub. And the only worker that will be able to process this message it the first one to have received it. No instance will be created in this case.

What you need to do is to tune your system to process the messages before the acknowledge deadline is expired. You may benefit by using push messages in some situation. I recommend to review this document regarding the backlog created by Pub/Sub.

like image 171
Nathan Nasser Avatar answered Jan 03 '23 02:01

Nathan Nasser