Failing to start shard in ElasticSearch IndexShardGatewayRecoveryException "sending failed"

Question

I'm getting this error, in my ES log I'm using three nodes.

Caused by: java.lang.ArrayIndexOutOfBoundsException
[2014-09-08 13:53:56,167][WARN ][cluster.action.shard     ] [Dancing Destroyer] [events][3] sending failed shard for [events][3], node[RDZy21y7SRep7n6oWT8ogg], [P], s[INITIALIZING], indexUUID [gzj1aHTnQX6XDc0SxkvxDQ], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[events][3] failed recovery]; nested: FlushFailedEngineException[[events][3] Flush failed]; nested: ArrayIndexOutOfBoundsException; ]]
[2014-09-08 13:53:56,357][WARN ][indices.cluster          ] [Dancing Destroyer] [events][3] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [events][3] failed recovery
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.index.engine.FlushFailedEngineException: [events][3] Flush failed
        at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
        at org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryFinalization(InternalIndexShard.java:726)
        at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:249)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
        ... 3 more
Caused by: java.lang.ArrayIndexOutOfBoundsException
[2014-09-08 13:53:56,381][WARN ][cluster.action.shard     ] [Dancing Destroyer] [events][3] sending failed shard for [events][3], node[RDZy21y7SRep7n6oWT8ogg], [P], s[INITIALIZING], indexUUID [gzj1aHTnQX6XDc0SxkvxDQ], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[events][3] failed recovery]; nested: FlushFailedEngineException[[events][3] Flush failed]; nested: ArrayIndexOutOfBoundsException; ]]

This means that the status of ES is red and I'm missing nearly 10 million documents. What does this error mean, so that I'd might be able to recover?

Repox · Accepted Answer

It seems that I had a messed up shard, that needed fixing. It's a Lucene thing, where you tell Lucene to fix the shard.

For Ubuntu, the solution was to go to the /usr/share/elasticsearch/lib directory and find out which Lucene core version is was running (running ls will show you a file named something like lucene-core-4.8.1.jar) and then type:

java -cp lucene-core-x.x.x.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /var/lib/elasticsearch/<clustername>/nodes/0/indices/<index>/<shard>/index/ -fix

Replace the x.x.x with the Lucene core version, with your clustername, index with the name of the index and of course with the failing shard number.

This can potentially give a loss of documents

But it fixed our issue.

karthik r · Answer

I faced this problem multiple times. As my set up is to read clickstream data (12-20M hits a day), I could not afford data losses.

So this was my solution and it runs beautifully:

Solution:

Stop elasticsearch from running
go to /path/to/my/data/mycluster_name/nodes/0/indices/myindex_name/index
delete segments.gen file
Start elasticsearch

Problem Root Cause

Shards fail for various reasons, especially when shards are not able to fulfill Kibana requests.
Lucene is not directly connected to this process. So when there are issues, elasticsearch is not able to efficiently pick the shards value from Lucene segments references that are stored in segments.gen
Lucene sets this value fresh again in the next run. So elasticsearch is able to reference the values correctly. And the shard issue is resolved.

f01 · Answer

Taking hints from Repox. In Centos 6.5 with the built-in ElasticSearch in logstash provisioned using Chef.

https://github.com/lusis/chef-logstash (v0.10.0)
logstash 1.4.2
ES/logstash working directory eg /opt/logstash/forwarder

java -cp /opt/logstash/forwarder/vendor/jar/elasticsearch-1.1.1/lib/lucene-core-4.7.2.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /opt/logstash/forwarder/data/elasticsearch/nodes/0/indices/logstash-2014.11.01/3/index/ -fix

But even with fixing I still see Failed to start shard, message ... failed to recover shard. I have to destructively delete using eg curator delete --older-than 3.

Failing to start shard in ElasticSearch IndexShardGatewayRecoveryException "sending failed"

Tags:

elasticsearch

sharding

Repox

3 Answers

Repox

karthik r

f01

Recent Activity

Donate For Us

Failing to start shard in ElasticSearch IndexShardGatewayRecoveryException "sending failed"

Tags:

elasticsearch

sharding

Repox

3 Answers

Repox

karthik r

f01

Related questions

Recent Activity

Donate For Us