I'm having a recurrent issue with Kafka: I partition messages by customer id, and sometimes it happens that a customer gets a huge amount of messages. As a result, the messages of this customer and all other customers in the same partition get delayed.
Are there well-known ways to handle this issue? Possibly with other messaging platforms?
Ideally, only the messages of one customer would be delayed. Other customer's messages would get an equal share of consumers' bandwidth.
Note: I must partition by customer id, because I want to consume the messages of any given custom in order. However, I can consume the messages of two customers in any order.
Often the data will stay there and get deleted once a specified retention period or max size/data limit has been reached. As for you other question about the reasoning for having more partitions well it simply comes down to scaling.
Kafka may distribute messages by a key associated with each message, if a key is the same for some messages, all of them will be put in the same partition. Message brokers may offer some strategies to deliver messages: “only once” or “at least once”.
But here are a few general rules: maximum 4000 partitions per broker (in total; distributed over many topics) maximum 200,000 partitions per Kafka cluster (in total; distributed over many topics) resulting in a maximum of 50 brokers per Kafka cluster.
If it is only one Broker, both partitions are stored in same Broker. That means broker count is less than the partition count, multiple partition of same topic is available in any one of the broker. Apache Kafka is distributed system.
The Kafka message is a small or medium piece of the data. For Kafka, the messages are nothing but a simple array of bytes. The Kafka messages are nothing but information or the data. Which is coming from the different sources? The sources are not specific to any platform or software.
Adding partitions in Kafka introduces latency spikes while rebalancing occurs, so we tend to scale the partitions according to the peak loads and scaling out needs of the consumers. But if we do need to increase the numbers of partitions and consumers for scaling purposes then we only need pay a momentary latency cost while rebalancing occurs.
Lz4 is the weakest candidate in terms of compression ratio. Based on our own test results, enabling compression when sending messages using Kafka can provide great benefits in terms of disk space utilization and network usage, with only slightly higher CPU utilization and increased dispatch latency.
By using the hashing function for routing messages to partitions, Kafka gives us data locality. For example, messages related to user id 1001 always go to consumer 3. Because user 1001's events always go to consumer 3 means that consumer 3 can performantly do some operations that would not be feasible if network round-trips were needed.
I will try and answer based on the limited information porovided.
Kafka partitoins are the smalles unit of scalability, so for example, if you have 10 parallel consumers (kafka topic listeners) you should partiton your topic by this number or higher otherwise, some of your listeners will bet starved as kafka manage the consumers in a way that only one consumer will be getting messages from a partiton. This is to protect the partiton from mixing messages order. The other way is supported as consumers can handle more than one partiton at a time.
My design solution will be to decide how much capacity are you planning to allocate for the consumers (microservices) instances? This number will guide you to the right number of partitons.
I would avoid using a dynamic number of partitons as this does not scale well. Use the number that match the capacity you plan to allocate and some extra spare in the case you need to scale up in the future. Let's say tomorrow you have 5 new customers, adding partitons is not easy or wise.
Kafka will make sure the messages stay in order per partition so this is free for your use case. What you need is on the consumer end to be able to handle the different customer ID messages in the right order. To avoid messages to the same customer get mixed order your partiton must be a higher level category of customers, I can think of customer type/region/size ... The idea is that all of a single customer messages stay in the same topic.
Your partitoin key must relate to the size of messages/data so your messages spread eavenly over your kafka cluster. This helps with the kafka cluster scale & redundency itself.
deciding on the right partitioning strategy is hard but it is worth the time spent on planning it.
One design solution come up a lot is hashing. Map a partition number using a HASH from customer ID to a partiton key. Again, decide on a fixed partiton number and let the HASH map the customer ID to your partiton key.
X customers have a lot of messages and you need to have one topic per customer. so in this case you map a customer per topic so your modulo will be the number of these customers.
Y customers are low trafic customers, for these customers use a different modulo of Y/5 for example so you have 5 customers sharing a topic.
make sure you add the X partiton number to the Y partition number so you dont overlap.
the only issue I see is this is not flexible, you cannot change the mapping if the number of customers changes. You might allow more topics in each group to support future partitons.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With