Nodes won't join cluster : NotMasterException (Weird master election bug)

Question

I'm setting up an elasticsearch (5.0.1) cluster.

It has three master-eligible nodes :

el-m01
el-m02
el-m03

The cluster fails to assemble, and Every master node gets the following NotMasterException exception in the logs :

[2016-11-21T15:24:13,274][INFO ][o.e.d.z.ZenDiscovery     ] [el-m01] failed to send join request to master [{el-m02}{bBhsu3fJSj-MyiWJGhQmog}{_IzdeUd4Sv6g-rhemGjEVQ}{192.168.110.118}{192.168.110.118:9300}{rack=r1}], reason [RemoteTransportException[[el-m02][192.168.110.118:9300][internal:discovery/zen/join]]; nested: NotMasterException[Node [{el-m02}{bBhsu3fJSj-MyiWJGhQmog}{_IzdeUd4Sv6g-rhemGjEVQ}{192.168.110.118}{192.168.110.118:9300}{rack=r1}] not master for join request]; ], tried [3] times

Enabling the debugging logs allowed me to understand the following :

The master election is happening, and is a success. However, while every node has chosen a master, no nodes thinks he is the master. i.e. :

el-m01 thinks el-m02 is the master
el-m02 thinks el-m03 is the master
el-m03 thinks el-m01 is the master

What is happening here?

A-y · Accepted Answer

Here is the situation : By cloning a VM to get all the masters, every node has the same node id.

This can be verified with the following command, listing all nodes ids :

GET /_cat/nodes?v&h=id,ip,name&full_id=true

Note that since your cluster hasn't formed, each node needs to be queried individually, i.e :

curl 192.168.110.111:9200/_cat/nodes?v&h=id,ip,name&full_id=true
curl 192.168.110.112:9200/_cat/nodes?v&h=id,ip,name&full_id=true
(...)

This is bad. the node ids need to be unique.

To solve this situation, you need to delete the indices (in /var/lib/elasticsearch) on every node. This will delete all data in elasticsearch, and will also reset the node ids.

To avoid having this problem in the first place, you can :

A. install elasticsearch after having cloned the VMs
B. use an automated tool like ansible or puppet to manage elasticsearch.

PhaedrusTheGreek · Answer

The Elasticsearch data directory $ES_HOME/data, or in the case of RPM, e.g., /var/lib/elasticsearch contains a randomly generated node ID when Elasticsearch is first started. If this directory is copied to multiple instances that are expected to form a cluster, the following error should be received:

failed to send join request to master [..] IllegalArgumentException [..] found existing node [..] with the same id but is a different node instance

However, when minimum_master_nodes is not met, an error less indicative of the problem is received:

failed to send join request to master [..] NotMasterException [..] not master for join request

Github: https://github.com/elastic/elasticsearch/issues/32904

The issue can be resolved by deleting the contents of the data directory, and data directories shouldn't be copied in the first place.

Nodes won't join cluster : NotMasterException (Weird master election bug)

Tags:

elasticsearch

elasticsearch-5

A-y

2 Answers

A-y

PhaedrusTheGreek

Recent Activity

Donate For Us

Nodes won't join cluster : NotMasterException (Weird master election bug)

Tags:

elasticsearch

elasticsearch-5

A-y

2 Answers

A-y

PhaedrusTheGreek

Related questions

Recent Activity

Donate For Us