Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nodes won't join cluster : NotMasterException (Weird master election bug)

I'm setting up an elasticsearch (5.0.1) cluster.

It has three master-eligible nodes :

el-m01
el-m02
el-m03

The cluster fails to assemble, and Every master node gets the following NotMasterException exception in the logs :

[2016-11-21T15:24:13,274][INFO ][o.e.d.z.ZenDiscovery     ] [el-m01] failed to send join request to master [{el-m02}{bBhsu3fJSj-MyiWJGhQmog}{_IzdeUd4Sv6g-rhemGjEVQ}{192.168.110.118}{192.168.110.118:9300}{rack=r1}], reason [RemoteTransportException[[el-m02][192.168.110.118:9300][internal:discovery/zen/join]]; nested: NotMasterException[Node [{el-m02}{bBhsu3fJSj-MyiWJGhQmog}{_IzdeUd4Sv6g-rhemGjEVQ}{192.168.110.118}{192.168.110.118:9300}{rack=r1}] not master for join request]; ], tried [3] times

Enabling the debugging logs allowed me to understand the following :

The master election is happening, and is a success. However, while every node has chosen a master, no nodes thinks he is the master. i.e. :

  • el-m01 thinks el-m02 is the master
  • el-m02 thinks el-m03 is the master
  • el-m03 thinks el-m01 is the master

What is happening here?

like image 317
A-y Avatar asked Nov 25 '16 18:11

A-y


2 Answers

Here is the situation : By cloning a VM to get all the masters, every node has the same node id.

This can be verified with the following command, listing all nodes ids :

GET /_cat/nodes?v&h=id,ip,name&full_id=true

Note that since your cluster hasn't formed, each node needs to be queried individually, i.e :

curl 192.168.110.111:9200/_cat/nodes?v&h=id,ip,name&full_id=true
curl 192.168.110.112:9200/_cat/nodes?v&h=id,ip,name&full_id=true
(...)

This is bad. the node ids need to be unique.

To solve this situation, you need to delete the indices (in /var/lib/elasticsearch) on every node. This will delete all data in elasticsearch, and will also reset the node ids.

To avoid having this problem in the first place, you can :

  • A. install elasticsearch after having cloned the VMs
  • B. use an automated tool like ansible or puppet to manage elasticsearch.
like image 123
A-y Avatar answered Dec 27 '22 13:12

A-y


The Elasticsearch data directory $ES_HOME/data, or in the case of RPM, e.g., /var/lib/elasticsearch contains a randomly generated node ID when Elasticsearch is first started. If this directory is copied to multiple instances that are expected to form a cluster, the following error should be received:

failed to send join request to master [..] IllegalArgumentException [..] found existing node [..] with the same id but is a different node instance

However, when minimum_master_nodes is not met, an error less indicative of the problem is received:

failed to send join request to master [..] NotMasterException [..] not master for join request

Github: https://github.com/elastic/elasticsearch/issues/32904

The issue can be resolved by deleting the contents of the data directory, and data directories shouldn't be copied in the first place.

like image 35
PhaedrusTheGreek Avatar answered Dec 27 '22 12:12

PhaedrusTheGreek