I am looking for some suggestion on Kafka broker auto scaling up and down based on the load.
Let us say we have an e-commerce site and we are capturing certain activities or events and these events are send to Kafka. Since during the peak hours/days the site traffic will be more so having the ideal kafka cluster with fixed number of brokers always is not a good idea so we want to scale it up the number of brokers when site traffic is more and scale it down the number of brokers when traffic is less.
How does people solve this kind of issue? i am not able to find any resource in this topic. any help will be greatly appreciated.
You can't implement automatic scaling when you create a cluster. You must first create the cluster, and then create and enable an auto-scaling policy for it. However, you can create the policy while Amazon MSK service creates your cluster.
To scale the Kafka connector side you have to increase the number of tasks, ensuring that there are sufficient partitions. In theory, you can set the number of partitions to a large number initially, but in practice, this is a bad idea.
This is all well and great, but stripped down to its core, Kafka is a distributed, horizontally scalable, fault-tolerant commit log. Those were some fancy words – let's go at them one by one and see what they mean.
Kafka doesn't really work that way. Adding/removing brokers from the cluster is a very hands-on process, and it creates a lot of additional load/overhead on the cluster, so you wouldn't want the cluster to be automatically scaling up or down by itself. The main reason why it creates so much additional overhead is that adding or removing brokers requires lots of data copying across the cluster, on top of the normal traffic. Basically, all the data from a dead broker needs to be copied somewhere else, to keep the same replication factor for the topic/partitions, or if it's a new broker, data needs to be shuffled into it from the other brokers, so that the load on the cluster as a whole is reduced. All this data being copied around creates lots of IO/CPU load on the cluster, and it might be enough to cause significant problems.
The best way to handle this scenario is to do performance testing and optimization with 2x or even 3x the traffic you'd expect during peak hours, and build out the cluster accordingly. This way, you'll have plenty of headroom if there are sudden spikes, and you won't have to scale-out/scale-in.
Kafka is extremely performant, even for traffic of millions of messages per second, so you will probably find that the cluster size your application/system requires is not as large/expensive as you initially thought.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With