How does kafka handle network partitions?

Tags:

consensus

Kafka has the concept of a in-sync replica set, which is the set of nodes that aren't too far behind the leader.

What happens if the network cleanly partitions so that a minority containing the leader is on one side, and a majority containing the other in-sync nodes on the other side?

The minority/leader-side presumably thinks that it lost a bunch of nodes, reduces the ISR size accordingly, and happily carries on.

The other side probably thinks that it lost the leader, so it elects a new one and happily carries on.

Now we have two leaders in the same cluster, accepting writes independently. In a system that requires a majority of nodes to proceed after a partition, the old leader would step down and stop accepting writes.

What happens in this situation in Kafka? Does it require majority vote to change the ISR set? If so, is there a brief data loss until the leader side detects the outages?

373

asked Feb 16 '18 11:02

Filip Haglund

2 Answers

I haven't tested this, but I think the accepted answer is wrong and Lars Francke is correct about the possibility of brain-split.

Zookeeper quorum requires a majority, so if ZK ensemble partitions, at most one side will have a quorum.

Being a controller requires having an active session with ZK (ephemeral znode registration). If the current controller is partitioned away from ZK quorum, it should voluntarily stop considering itself a controller. This should take at most zookeeper.session.timeout.ms = 6000. Brokers still connected to ZK quorum should elect a new controller among themselves. (based on this: https://stackoverflow.com/a/52426734)

Being a topic-partition leader also requires an active session with ZK. Leader that lost a connection to ZK quorum should voluntarily stop being one. Elected controller will detect that some ex-leaders are missing and will assign new leaders from the ones in ISR and still connected to ZK quorum.

Now, what happens to producer requests received by the partitioned ex-leader during ZK timeout window? There are some possibilities.

If producer's acks = all and topic's min.insync.replicas = replication.factor, then all ISR should have exactly the same data. The ex-leader will eventually reject in-progress writes and producers will retry them. The newly elected leader will not have lost any data. On the other hand it won't be able to serve any write requests until the partition heals. It will be up to producers to decide to reject client requests or keep retrying in the background for a while.

Otherwise, it is very probable that the new leader will be missing up to zookeeper.session.timeout.ms + replica.lag.time.max.ms = 16000 worth of records and they will be truncated from the ex-leader after the partition heals.

Let's say you expect longer network partitions than you are comfortable with being read-only.

Something like this can work:

you have 3 availability zones and expect that at most 1 zone will be partitioned from the other 2
in each zone you have a Zookeeper node (or a few), so that 2 zones combined can always form a majority
in each zone you have a bunch of Kafka brokers
each topic has replication.factor = 3, one replica in each availability zone, min.insync.replicas = 2
producers' acks = all

This way there should be two Kafka ISRs on ZK quorum side of the network partition, at least one of them fully up to date with ex-leader. So no data loss on the brokers, and available for writes from any producers that are still able to connect to the winning side.

112

answered Oct 31 '22 00:10

Alexander Abramov

In a Kafka cluster, one of the brokers is elected to serve as the controller.

Among other things, the controller is responsible for electing new leaders. The Replica Management section covers this briefly: http://kafka.apache.org/documentation/#design_replicamanagment

Kafka uses Zookeeper to try to ensure there's only 1 controller at a time. However, the situation you described could still happen, spliting both the Zookeeper ensemble (assuming both sides can still have quorum) and the Kafka cluster in 2, resulting in 2 controllers.

In that case, Kafka has a number of configurations to limit the impact:

unclean.leader.election.enable: False by default, this is used to prevent replicas that were not in-sync to ever become leaders. If no available replicas are in-sync, Kafka marks the partition as offline, preventing data loss
replication.factor and min.insync.replicas: For example, if you set them to 3 and 2 respectively, in case of a "split-brain" you can prevent producers from sending records to the minority side if they use acks=all

See also KIP-101 for the details about handling logs that have diverged once the cluster is back together.

answered Oct 31 '22 00:10

Mickael Maison

Related questions
                            
                                Kubernetes - Container image already present on machine
                            
                                kafka failed authentication due to: SSL handshake failed
                            
                                Kafka consumer exception and offset commits
                            
                                Rebalancing issue while reading messages in Kafka
                            
                                Apache Kafka message consumption when partitions outnumber consumers
                            
                                KafkaConsumer Java API subscribe() vs assign()
                            
                                Using Kafka Producer by different threads
                            
                                Kafka topic partitions to Spark streaming
                            
                                What does PLAINTEXT keyword means in Kafka configuration?
                            
                                Is Kafka ready for production use?
                            
                                Does Kafka python API support stream processing?
                            
                                'make' command not found in docker container
                            
                                In Spring Kafka, do I need to add the @EnableKafka annotation to my application?
                            
                                Kafka Streams with Spring Boot
                            
                                Kafka Broker doesn't find cluster id and creates new one after docker restart
                            
                                Is a web frontend producing directly to a Kafka broker a viable idea?
                            
                                Spark streaming with Kafka - createDirectStream vs createStream
                            
                                Delay in Consumer consuming messages in Apache Kafka
                            
                                How to stop spark streaming when the data source has run out
                            
                                Kafka is giving: "The group member needs to have a valid member id before actually entering a consumer group"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With