Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to choose the no of partitions for a kafka topic?

We have 3 zk nodes cluster and 7 brokers. Now we have to create a topic and have to create partitions for this topic.

But I did not find any formula to decide that how much partitions should I create for this topic. Rate of producer is 5k messages/sec and size of each message is 130 Bytes.

Thanks In Advance

like image 426
Rajendra Jangir Avatar asked May 10 '18 11:05

Rajendra Jangir


People also ask

How many partitions should I have in Kafka topic?

A Kafka cluster should have a maximum of 200,000 partitions across all brokers when managed by Zookeeper. The reason is that if brokers go down, Zookeeper needs to perform a lot of leader elections. Confluent still recommends up to 4,000 partitions per broker in your cluster.

How do I choose the number of partitions?

The best way to decide on the number of partitions in an RDD is to make the number of partitions equal to the number of cores in the cluster so that all the partitions will process in parallel and the resources will be utilized in an optimal way.

What is the maximum number of partitions Kafka?

In Kafka, a topic can have multiple partitions to which records are distributed.

Can I increase number of partitions Kafka?

If you want to change the number of partitions or replicas of your Kafka topic, you can use a streaming transformation to automatically stream all of the messages from the original topic into a new Kafka topic that has the desired number of partitions or replicas.


2 Answers

I can't give you a definitive answer, there are many patterns and constraints that can affect the answer, but here are some of the things you might want to take into account:

  • The unit of parallelism is the partition, so if you know the average processing time per message, then you should be able to calculate the number of partitions required to keep up. For example if each message takes 100ms to process and you receive 5k a second then you'll need at least 50 partitions. Add a percentage more that that to cope with peaks and variable infrastructure performance. Queuing Theory can give you the math to calculate your parallelism needs.

  • How bursty is your traffic and what latency constraints do you have? Considering the last point, if you also have latency requirements then you may need to scale out your partitions to cope with your peak rate of traffic.

  • If you use any data locality patterns or require ordering of messages then you need to consider future traffic growth. For example, you deal with customer data and use your customer id as a partition key, and depend on each customer always being routed to the same partition. Perhaps for event sourcing or simply to ensure each change is applied in the right order. Well, if you add new partitions later on to cope with a higher rate of messages, then each customer will likely be routed to a different partition now. This can introduce a few headaches regarding guaranteed message ordering as a customer exists on two partitions. So you want to create enough partitions for future growth. Just remember that is easy to scale out and in consumers, but partitions need some planning, so go on the safe side and be future proof.

  • Having thousands of partitions can increase overall latency.

like image 129
Vanlightly Avatar answered Sep 25 '22 15:09

Vanlightly


This old benchmark by Kafka co-founder is pretty nice to understand the magnitudes of scale - https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines

The immediate conclusion from this, like Vanlightly said here, is that the consumer handling time is the most important factor in deciding on number of partition (since you are not close to challenge the producer throughput).

maximal concurrency for consuming is the number of partitions, so you want to make sure that:

((processing time for one message in seconds x number of msgs per second) / num of partitions) << 1

if it equals to 1, you cannot read faster than writing, and this is without mentioning bursts of messages and failures\downtime of consumers. so you will need to it to be significantly lower than 1, how significant depends on the latency that your system can endure.

like image 30
H. Opler Avatar answered Sep 25 '22 15:09

H. Opler