Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

etcd cluster id mistmatch

Hey I have a cluster id mismatch for some reason, i had it on 1 node then disapperead after clearing data dir few times , changing cluster token and node names, but apperead on another

here is the script i use

IP0=10.150.0.1
IP1=10.150.0.2
IP2=10.150.0.3
IP3=10.150.0.4
NODENAME0=node0
NODENAME1=node1
NODENAME2=node2
NODENAME3=node3

# changing these on each box
THISIP=$IP2
THISNODENAME=$NODENAME2

etcd --name $THISNODENAME --initial-advertise-peer-urls http://$THISIP:2380 \
 --data-dir /root/etcd-data \
 --listen-peer-urls http://$THISIP:2380 \
 --listen-client-urls http://$THISIP:2379,http://127.0.0.1:2379 \
 --advertise-client-urls http://$THISIP:2379 \
 --initial-cluster-token etcd-cluster-2 \
 --initial-cluster $NODENAME0=http://$IP0:2380,$NODENAME1=http://$IP1:2380,$NODENAME2=http://$IP2:2380,$NODENAME3=http://$IP3:2380 \
 --initial-cluster-state new

I get

2016-11-11 22:13:12.090515 I | etcdmain: etcd Version: 2.3.7   
2016-11-11 22:13:12.090643 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2016-11-11 22:13:12.090713 I | etcdmain: listening for peers on http://10.150.0.3:2380
2016-11-11 22:13:12.090745 I | etcdmain: listening for client requests on http://10.150.0.3:2379
2016-11-11 22:13:12.090771 I | etcdmain: listening for client requests on http://127.0.0.1:2379
2016-11-11 22:13:12.090960 I | etcdserver: name = node2
2016-11-11 22:13:12.090976 I | etcdserver: data dir = /root/etcd-data
2016-11-11 22:13:12.090983 I | etcdserver: member dir = /root/etcd-data/member
2016-11-11 22:13:12.090990 I | etcdserver: heartbeat = 100ms
2016-11-11 22:13:12.090995 I | etcdserver: election = 1000ms
2016-11-11 22:13:12.091001 I | etcdserver: snapshot count = 10000
2016-11-11 22:13:12.091011 I | etcdserver: advertise client URLs = http://10.150.0.3:2379
2016-11-11 22:13:12.091269 I | etcdserver: restarting member 7fbd572038b372f6 in cluster 4e73d7b9b94fe83b at commit index 4
2016-11-11 22:13:12.091317 I | raft: 7fbd572038b372f6 became follower at term 8
2016-11-11 22:13:12.091346 I | raft: newRaft 7fbd572038b372f6 [peers: [], term: 8, commit: 4, applied: 0, lastindex: 4, lastterm: 1]
2016-11-11 22:13:12.091516 I | etcdserver: starting server... [version: 2.3.7, cluster version: to_be_decided]
2016-11-11 22:13:12.091869 E | etcdmain: failed to notify systemd for readiness: No socket
2016-11-11 22:13:12.091894 E | etcdmain: forgot to set Type=notify in systemd service file?
2016-11-11 22:13:12.096380 N | etcdserver: added member 7508b3e625cfed5 [http://10.150.0.4:2380] to cluster 4e73d7b9b94fe83b
2016-11-11 22:13:12.099800 N | etcdserver: added member 14c76eb5d27acbc5 [http://10.150.0.1:2380] to cluster 4e73d7b9b94fe83b
2016-11-11 22:13:12.100957 N | etcdserver: added local member 7fbd572038b372f6 [http://10.150.0.2:2380] to cluster 4e73d7b9b94fe83b
2016-11-11 22:13:12.102711 N | etcdserver: added member d416fca114f17871 [http://10.150.0.3:2380] to cluster 4e73d7b9b94fe83b
2016-11-11 22:13:12.134330 E | rafthttp: request cluster ID mismatch (got cfd5ef74b3dcf6fe want 4e73d7b9b94fe83b)

the other memebers are not even running, how that's possible ?

Thank you

like image 494
davidear Avatar asked Nov 14 '16 09:11

davidear


People also ask

What happens when etcd loses quorum?

If you lose etcd quorum, you can restore it. If you run etcd on a separate host, you must back up etcd, take down your etcd cluster, and form a new one. You can use one healthy etcd node to form a new cluster, but you must remove all other healthy nodes.

Why does etcd lose its leader from disk latency spikes?

Why does etcd lose its leader from disk latency spikes? This is intentional; disk latency is part of leader liveness. Suppose the cluster leader takes a minute to fsync a raft log update to disk, but the etcd cluster has a one second election timeout.

How do I find my etcd leader?

Run the etcdctl command to find out the ETCD Leader.

Does etcd use raft?

To ensure data consistency across these nodes, etcd uses a popular consensus algorithm called Raft. In Raft, one node is elected as leader, while the remaining nodes are designated as followers. The leader is responsible for maintaining the current state of the system and ensuring that the followers are up-to-date.


1 Answers

For all those who stumble upon this from google:

The error is about peer member ID, that tries to join cluster with same name as another member (probably old instance) that already exists in cluster (with same peer name, but another ID, this is the problem).

you should delete the peer and re-add it like shown in this helpful post:

In order to fix this it was pretty simple, first we had to log into an existing working server on the rest of the cluster and remove server00 from its member list:

etcdctl member remove <UID>

This free's up the ability to allow the new server00 to join but we needed to simply tell the cluster it could by issuing the add command:

etcdctl member add server00 http://1.2.3.4:2380

It you follow the logs on server00 you'll then see that everything spring into life. You can confirm this with the commands:

etcdctl member list

etcdctl cluster-health

Use "etcdctl member list" to find what are the IDs of current members, and find the one which tries to join cluster with wrong ID, then delete that peer from "members" with "etcdctl member remove " and try to rejoin him. Hope it helps.

like image 173
Dmitry Shmakov Avatar answered Sep 20 '22 17:09

Dmitry Shmakov