Why can't my Zookeeper server rejoin the Quorum?

Tags:

I have three servers in my quorum. They are running ZooKeeper 3.4.5. Two of them appear to be running fine based on the output from mntr. One of them was restarted a couple days ago due to a deploy, and since then has not been able to join the quorum. Some lines in the logs that stick out are:

2014-03-03 18:44:40,995 [myid:1] - INFO  [main:QuorumPeer@429] - currentEpoch not found! Creating with a reasonable default of 0. This should only happen when you are upgrading your installation

and:

2014-03-03 18:44:41,233 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (2, 1)
2014-03-03 18:44:41,234 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:QuorumCnxManager@190] - Have smaller server identifier, so dropping the connection: (3, 1)
2014-03-03 18:44:41,235 [myid:1] - INFO  [QuorumPeer[myid=1]/0.0.0.0:2181:FastLeaderElection@774] - Notification time out: 400

Googling for the first ('currentEpoch not found!') led me to JIRA ZOOKEEPER-1653 - zookeeper fails to start because of inconsistent epoch. It describes a bug fix but doesn't describe a way to resolve the issue without upgrading zookeeper.

Googling for the second ('Have smaller server identifier, so dropping the connection') led me to JIRA ZOOKEEPER-1506 - Re-try DNS hostname -> IP resolution if node connection fails. This makes sense because I am using AWS Elastic IPs for the servers. The fix for this issue seems to be to do a rolling restart, which would cause us to temporarily lose quorum.

It looks like the second issue is definitely in play because I see timeouts in the other ZooKeeper server's logs (the ones still in the quorum) when trying to connect to the first server. What I'm not sure of is if the first issue will disappear when I do a rolling restart. I would like to avoid upgrading and/or doing a rolling restart, but if I have to do a rolling restart I'd like to avoid doing it multiple times. Is there a way to fix the first issue without upgrading? Or even better: Is there a way to resolve both issues without doing a rolling restart?

Thanks for reading and for your help!

482

asked Mar 03 '14 19:03

fpearsall

1 Answers

This is a bug of zookeeper: Server is unable to join quorum after connection broken to other peers Restart the leader solves this issue.

answered Sep 17 '22 00:09

robbie

Related questions
                            
                                Bug in DateTime.ToString("T") and DateTime.ToString("G")?
                            
                                Android Network Service Discovery "timeout"?
                            
                                Fixed persistent header and scroll to focussed input fields
                            
                                Generation of documentation comments in AppCode 3.0 like in IntelliJ
                            
                                Is there a way to create JavaScript objects that behave like C++ RValues?
                            
                                Weird GCC array initialization behavior
                            
                                In UISplitViewController, can't make showDetailViewController:sender: push onto detail navigationController
                            
                                How to use SVG symbols in CSS?
                            
                                IE11 prevents ActiveX from running
                            
                                how to make synchronous http request in angular js
                            
                                Why bother with service discovery when message oriented middleware does the job?
                            
                                WakeLock finalized while still held error even though I'm releasing it

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With