We have a streaming pipeline that we have enabled autoscaling on. Generally, one worker is enough to process the incoming data, but we want to automatically increase the number of workers if there is a backlog.
Our pipeline reads from Pubsub, and writes batches to BigQuery using load jobs every 3 minutes. We ran this pipeline starting with one worker, publishing twice as much data to pubsub as one worker could consume. After 2 hours, autoscaling had still not kicked in, so the backlog would have been about 1 hour's worth of data. This seems rather poor given that autoscaling aims to keep the backlog under 10 seconds (according to this SO answer).
The document here says that autoscaling for streaming jobs is in beta, and is known to be course-grained if the sinks are high-latency. And yeah, I guess doing BigQuery batches every 3 minutes counts as high-latency! Is there any progress being made on improving this autoscaling algorithm?
Are there any work-arounds we can do in the meantime, such as measuring throughput at a different point in the pipeline? I couldn't find any documentation on how the throughput gets reported to the autoscaling system.
The back log is created by unacknowledged messages, I guess that you are using pull subscriptions. If a message takes longer to process than to acknowledge, it will be resend as per the at-least-once delivery from Pub/Sub. And the only worker that will be able to process this message it the first one to have received it. No instance will be created in this case.
What you need to do is to tune your system to process the messages before the acknowledge deadline is expired. You may benefit by using push messages in some situation. I recommend to review this document regarding the backlog created by Pub/Sub.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With