Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cluster hanging on node failure

Hello all of you bright people,

We’re currently running a smallish 300 GB cluster in production on 5 nodes with around 30 mil docs. Everything works flawlessly except when a node really goes down (I mean like network or HW failure).

Generally when we lose a node the cluster becomes more or less completely unresponsive for a few minutes. Both regarding indexing and querying. This is of course, less than ideal as we have load 24/7.

I would really appreciate some help with understanding best practice settings to have robust cluster.

First goal for us is for the cluster to not become unresponsive in the event of a node crash. After reading everything I could find on the web I can't really understand if ES is designed to be unresponsive for ping_retries*ping_timeout seconds or if the cluster will continue to server query requests even during this time. Could anyone help me shed light on this?

Secondly in the event of a even worse failure where the cluster goes into red state, would it be possible to allow the cluster to still serve read/query requests?

I would be ever so grateful for anyone willing to help me understand how this works or what we would need to change to make our ES installation more robust.

I’ve included our config here:

cluster.name: clustername
node.name: nodename
path.data: /data
node.master: true
node.data: true
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.multicast.ping.enabled: false
discovery.zen.ping.unicast.enabled: true
discovery.zen.ping.unicast.hosts: ["host1","host2","host3"]
bootstrap.mlockall: true
http.cors.enabled: true
index.number_of_shards: 10
action.disable_delete_all_indices: true
marvel.agent.exporter.es.hosts: ["marvel:9200"]
like image 513
Max Charas Avatar asked Feb 19 '15 08:02

Max Charas


1 Answers

Cluster hangs on failure because of the fault detection timeout value:

discovery.zen.fd.ping_interval: 1s -> default 1s
discovery.zen.fd.ping_timeout: 2s -> default 30 secs
discovery.zen.fd.ping_retries: 3 -> default 3 secs

There are two fault detection processes running.

The first is by the master, pings all the other nodes in the cluster and verify that they are alive.

Second, each node pings to master to verify if its still alive or an election process needs to be initiated.

With above Configuration: If a node fails, Master will retries 3 times with timeout of 2 seconds (sum=6secs hang) instead of 90s wait(hang).

Please note I'm running cluster on local network with <1ms and 1Gbps connectivity, Depending on your environment you should adjust accordingly. I'm on elasticsearch 5.1.1, you should refer to your version documentation for exact syntax.

like image 199
Farhad Farahi Avatar answered Sep 21 '22 20:09

Farhad Farahi