I have a partitioned topic, which has X
partitions.
As of now, when producing messages, I create Kafka's ProducerRecord
specifying only topic
and value
. I do not define a key
.
As far as I understand, my messages gonna be distributed evenly amongst partitions using default built-in partitioner.
On the other hand, I have a thread pool of Kafka consumers. Each Kafka consumer will be running in its own dedicated thread consuming messages from the topic. Each of those consumers is given the same group.id
. This will allow consuming messages in parallel. Every consumer will be assigned its fair share of partitions to read from.
I want my messages to be consumed in an orderly fashion. I know that Kafka guarantees the order of messages within a partition. So, as long as I come up with a proper key structure, I will have my messages partitioned in a way that they will end up in the same partition. In a way, message key groups messages and stores them in the partition.
Does it make sense?
Q: Is there a chance that due to a badly designed key I will get uneven partitions? One may receive way more records than the others. Can it impact in a bad way performance of my Kafka cluster? What are the best practices for message key design?
key edit. Optional Kafka event key. If configured, the event key must be unique and can be extracted from the event using a format string.
Usually, the key of a Kafka message is used to select the partition and the return value (of type int ) is the partition number. Without a key, you need to rely on the value which might be much more complex to process.
Kafka uses the abstraction of a distributed log that consists of partitions. Splitting a log into partitions allows to scale-out the system. Keys are used to determine the partition within a log to which a message get's appended to. While the value is the actual payload of the message.
But here are a few general rules: maximum 4000 partitions per broker (in total; distributed over many topics) maximum 200,000 partitions per Kafka cluster (in total; distributed over many topics) resulting in a maximum of 50 brokers per Kafka cluster.
Your understanding of default partitioner is correct.
When you don't have a requirement to consume some messages in the same order as they were produced then not specifying a key is the best option. If that is not your case, then your requirement tells you what must be your key. For instance if you want to preserve the order of produced messages for a given user, a user_id is potentially your message key.
To achieve a particular messages order you need to think how producers are configured. If your producers can retry sending a message in case of failure and in flight messages
are higher than 1 then messages can be produced out of order.
You can get uneven partition by specifying bad key. For example, if 90% of your users are from New York and 10% from other cities and you choose a city as a key, then one of yours partition will be huge and one of the consumers overloaded (I assume that the number of messages per user is the same).
Kafka will apply murmur hash on the key and modulo with number of partitions so it i.e. murmur2(record.key())) % num partitions. In all likely hood it should get evenly distributed in the case of default partitioning. I would suggest you to experiment all your key options with a simple murmur2 function written in java to see the distribution pattern and then make a choice. Also there are two implementations of default partitioning in kafka. Murmur hash implementation is in the newer version. Old legacy versions work differently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With