I have three servers in my quorum. They are running ZooKeeper 3.4.5. Two of them appear to be running fine based on the output from mntr
. One of them was restarted a couple days ago due to a deploy, and since then has not been able to join the quorum. Some lines in the logs that stick out are:
2014-03-03 18:44:40,995 [myid:1] - INFO [main:QuorumPeer@429] - currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation
and:
2014-03-03 18:44:41,233 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (2, 1)
2014-03-03 18:44:41,234 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (3, 1)
2014-03-03 18:44:41,235 [myid:1] - INFO [QuorumPeer[myid=1]/0.0.0.0:2181:FastLeaderElection@774] - Notification time out: 400
Googling for the first ('currentEpoch not found!') led me to JIRA ZOOKEEPER-1653 - zookeeper fails to start because of inconsistent epoch. It describes a bug fix but doesn't describe a way to resolve the issue without upgrading zookeeper.
Googling for the second ('Have smaller server identifier, so dropping the connection') led me to JIRA ZOOKEEPER-1506 - Re-try DNS hostname -> IP resolution if node connection fails. This makes sense because I am using AWS Elastic IPs for the servers. The fix for this issue seems to be to do a rolling restart, which would cause us to temporarily lose quorum.
It looks like the second issue is definitely in play because I see timeouts in the other ZooKeeper server's logs (the ones still in the quorum) when trying to connect to the first server. What I'm not sure of is if the first issue will disappear when I do a rolling restart. I would like to avoid upgrading and/or doing a rolling restart, but if I have to do a rolling restart I'd like to avoid doing it multiple times. Is there a way to fix the first issue without upgrading? Or even better: Is there a way to resolve both issues without doing a rolling restart?
Thanks for reading and for your help!
It's basically the minimum number of server nodes that must be up and running and available for client requests. Any update done to the ZooKeeper tree by the clients must be persistently stored in this quorum of nodes for a transaction to be completed successfully.
Create a file named as myid under the Zookeeper data directory in each Zookeeper server. This file should contain the server number X as an entry in it. server_name is the hostname of the node where the Zookeeper service is started. port1 , ZooKeeper server uses this port to connect followers to the leader.
If one the ZooKeeper nodes fails, the following occurs: Other ZooKeeper nodes detect the failure to respond. A new ZooKeeper leader is elected if the failed node is the current leader. If multiple nodes fail and ZooKeeper loses its quorum, it will drop into read-only mode and reject requests for changes.
Stop the ZooKeeper service using systemctl . Finally, to restart the daemon, use the following command: sudo systemctl restart zk.
This is a bug of zookeeper: Server is unable to join quorum after connection broken to other peers Restart the leader solves this issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With