In our production environment, we often see that the partitions go under-replicated while consuming the messages from the topics. We are using Kafka 0.11. From the documentation what is understand is
Configuration parameter replica.lag.max.messages
was removed. Partition leaders will no longer consider the number of lagging messages when deciding which replicas are in sync.
Configuration parameter replica.lag.time.max.ms
now refers not just to the time passed since last fetch request from the replica, but also to time since the replica last caught up. Replicas that are still fetching messages from leaders but did not catch up to the latest messages in replica.lag.time.max.ms
will be considered out of sync.
How do we fix this issue? What are the different reasons for replicas go out of sync? In our scenario, we have all the Kafka brokers in the single RACK of the blade servers and all are using the same network with 10GBPS Ethernet(Simplex). I do not see any reason for the replicas to go out of sync due to the network.
The Under Replicated Partitions metric displays the number of partitions that do not have enough replicas to meet the desired replication factor.
Partitions are the way that Kafka provides redundancy.Kafka keeps more than one copy of the same partition across multiple brokers. This redundant copy is called a replica. If a broker fails, Kafka can still serve consumers with the replicas of partitions that failed broker owned.
A replication factor is the number of copies of data over multiple brokers. The replication factor value should be greater than 1 always (between 2 or 3). This helps to store a replica of the data in another broker from where the user can access it.
In Kafka, a message stream is defined by a topic, divided into one or more partitions. Replication happens at the partition level and each partition has one or more replicas. The replicas are assigned evenly to different servers (called brokers) in a Kafka cluster. Each replica maintains a log on disk.
We faced the same issue:
Solution was:
No data lose.
The issue is due to a faulty state in ZK, there was an opened issue on ZK for this, don't remember the number.
I faced the same issue on Kafka 2.0, On restart Kafka controller node everything caught-up on the replicas.
But still looking for the reasons why few partitions are under-replicated whereas the other partitions on the same nodes for the same topic works good, and this issue i see on a random partitions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With