I have a managed cluster hosted by elastio.co. Here is the configuration
|Platform => Amazon Web Services
| |Memory => 4 GB
|
|Storage => 96 GB
| |SSD => Yes
| |High availability => Yes 2 data centers
|
Each index in this cluster contain log data of exactly one day. Average index size is 15 mb
and average doc count is 15000
. The cluster is not in any way under any kind of pressure (JVM, Indexing & Searching time, Disk Space all are in very comfort zone)
When I opened a previously closed index the cluster is turned RED. Here are some matrices I found querying the elasticsearch.
GET /_cluster/allocation/explain
{
"index": "some_index_name", # 1 Primary shard , 1 replica shard
"shard": 0,
"primary": true
}
Response :
"unassigned_info": {
"reason": "ALLOCATION_FAILED"
"failed_allocation_attempts": 3,
"details": "failed recovery, failure RecoveryFailedException[[some_index_name][0]: Recovery failed on {instance-*****}{Hash}{HASH}{IP}{IP}{logical_availability_zone=zone-1, availability_zone=***, region=***}]; nested: IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: IndexShardRecoveryException[shard allocated for local recovery (post api), should exist, but doesn't, current files: []]; nested: IndexNotFoundException[no segments* file found in store(mmapfs(/app/data/nodes/0/indices/MFIFAQO2R_ywstzqrfbY4w/0/index)): files: []]; ",
"last_allocation_status": "no_valid_shard_copy"
},
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because all found copies of the shard are either stale or corrupt",
"node_allocation_decisions": [
{
"node_name": "instance-***",
"node_decision": "no",
"store": {
"in_sync": false,
"allocation_id": "RANDOM_HASH",
"store_exception": {
"type": "index_not_found_exception",
"reason": "no segments* file found in SimpleFSDirectory@/app/data/nodes/0/indices/RANDOM_HASH/0/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@346e1b99: files: []"
}
}
},
{
"node_name": "instance-***",
"node_attributes": {
"logical_availability_zone": "zone-0",
},
"node_decision": "no",
"store": {
"found": false
}
}
I've tried rerouting the shards to a node. Even setting data loss flag to true.
POST _cluster/reroute
{
"commands" : [
{"allocate_stale_primary" : {
"index" : "some_index_name", "shard" : 0,
"node" : "instance-***",
"accept_data_loss" : true
}
}
]
}
Response:
"acknowledged": true,
"state": {
"version": 338190,
"state_uuid": "RANDOM_HASH",
"master_node": "RANDOM_HASH",
"blocks": {
"indices": {
"restored_**: {
"4": {
"description": "index closed",
"retryable": false,
"levels": [
"read",
"write"
]
}
},
"restored_**": {
"4": {
"description": "index closed",
"retryable": false,
"levels": [
"read",
"write"
]
}
}
}
},
"routing_table": {
"indices": {
"SOME_INDEX_NAME": {
"shards": {
"0": [
{
"state": "INITIALIZING",
"primary": true,
"relocating_node": null,
"shard": 0,
"index": "SOME_INDEX_NAME",
"recovery_source": {
"type": "EXISTING_STORE"
},
"allocation_id": {
"id": "HASH"
},
"unassigned_info": {
"reason": "ALLOCATION_FAILED",
"failed_attempts": 4,
"delayed": false,
"details": "same as explanation above ^ ",
"allocation_status": "no_valid_shard_copy"
}
},
{
"state": "UNASSIGNED",
"primary": false,
"node": null,
"relocating_node": null,
"shard": 0,
"index": "some_index_name",
"recovery_source": {
"type": "PEER"
},
"unassigned_info": {
"reason": "INDEX_REOPENED",
"delayed": false,
"allocation_status": "no_attempt"
}
}
]
}
},
Any kind of suggestion is welcomed. Thanks and regards.
A cluster status that shows red status doesn't mean that your cluster is down. Rather, this status indicates that at least one primary shard and its replicas aren't allocated to a node. If your cluster status shows yellow status, then the primary shards for all indices are allocated to nodes in your cluster.
A red cluster indicates that at least one primary shard and all of its replicas are missing. This means that data is missing, searches will return partial results, and indexing into that shard will return errors.
A red status indicates that not only has the primary shard been lost, but also that a replica has not been promoted to primary in its place.
This occurs when the master-node is brought down abruptly.
Here are the steps I took to resolve the same issue, that I had encountered ,
Step 1: Check the allocation
Step 2: Check the shard stores
Step 3: Now reroute that index as shown below
Step 4: Repeat Step2 and Step3 until you see this output.
{ "indices" : { } }
Your cluster should go green soon.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With