we have setup a Kafka/Zookeeper Cluster consisting of 3 Brokers. We have one producer, sending messages to one specific Kafka topic and a few consumer groups reading from said topic. Those consumers perform a leader election via Zookeeper for themselves (independent from Kafka).
The versions used are:
All processes are managed by Supervisor. So far, everything works just fine. What we tried now (for testing purposes) was to simply kill off all Zookeeper processes and see what happens.
As we expected, our consumer processes couldn't connect to Zookeeper anymore. But unexpectedly, the Kafka Brokers still worked. Our producer didn't complain at all and was still able to write into the topic. While I couldn't use kafka/bin/kafka-topics.sh or similar, since they all require a zookeeper-parameter, I could still see the actual size of the topic-log grow. After restarting the zookeeper processes, everything again worked just like before.
What we couldn't figure out is now... what actually happened there? We thought, Kafka would require a working Zookeeper-Connection and we couldn't find any explanation for this behaviour online.
The ZK server is designed to be "fail fast" meaning that it will shutdown (process exit) if an error occurs that it cannot recover from. As a ZooKeeper serving cluster is highly reliable, this means that while the server may go down the cluster as a whole is still active and serving requests.
In essence, ZooKeeper was a Single Point of Failure (SPoF) in our stack. SPoF terminology usually refers to single machines or servers, but in this case, it was a single distributed service.
However, you can install and run Kafka without Zookeeper. In this case, instead of storing all the metadata inside Zookeeper, all the Kafka configuration data will be stored as a separate partition within Kafka itself.
ZooKeeper is an open source Apache project that provides a centralized service for providing configuration information, naming, synchronization and group services over large clusters in distributed systems. The goal is to make these systems easier to manage with improved, more reliable propagation of changes.
When you have one node of zookeeper, broker will not be able to contact zookeeper, after broker discovers zookeeper is not reachable, broker also will become unreachable. Hence the producer and consumer. In case of producer it starts dropping(reject the record). In case of consumer it can happen that, the read record which is not ack'ed may end up processing again when broker is up and ready...
in case of 3node zk one node failure is acceptable as quorum is still satisfied... but cant afford the 2node failures which will lead to the above consequences...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With