Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch - Optimal number of Shards per node

I would appreciate if someone could suggest the optimal number of shards per ES node for optimal performance or provide any recommended way to arrive at the number of shards one should use, given the number of cores and memory foot print.

like image 978
Rajan Avatar asked Mar 20 '14 20:03

Rajan


People also ask

What is the default number of shards Elasticsearch?

The default is designed to allow you to split by factors of 2 up to a maximum of 1024 shards. In Elasticsearch 7.0. 0 and later versions, this setting affects how documents are distributed across shards.

How do I set number of shards in Elasticsearch?

Once you set the number of shards for an index in ElasticSearch, you cannot change them. You will need to create a new index with the desired number of shards, and depending on your use case, you may want then to transfer the data to the new index.

How many shards including primary and replica shards in total are created by default?

primary vs replica shards – elasticsearch will create, by default, 5 primary shards and one replica for each index. That means that each elasticsearch index will be split into 5 chunks and each chunk will have one copy, for high availability.

What is the ideal size of an Elasticsearch index?

Though there is technically no limit to how much data you can store on a single shard, Elasticsearch recommends a soft upper limit of 50 GB per shard, which you can use as a general guideline that signals when it's time to start a new index.

What is the maximum size of a shards in Elasticsearch?

Aim for shard sizes between 10GB and 50GB edit Shards larger than 50GB may make a cluster less likely to recover from failure. When a node fails, Elasticsearch rebalances the node’s shards across the data tier’s remaining nodes. Shards larger than 50GB can be harder to move across a network and may tax node resources.

How do I limit the number of shards a node can have?

You can also limit the amount of shards a node can have regardless of the index: ( Dynamic ) Maximum number of primary and replica shards allocated to each node. Defaults to -1 (unlimited). Elasticsearch checks this setting during shard allocation.

What happens when a node fails in Elasticsearch?

If node C fails, Elasticsearch reallocates its shard to node B. Reallocating the shard to node A would exceed node A’s shard limit. These settings impose a hard limit which can result in some shards not being allocated.

What is sharding in Elasticsearch and how does it work?

Each shard is actually a separate Lucene index. When you run a query, Elasticsearch must run that query against each shard, and then compile the individual shard results together to come up with a final result to send back. The benefit to sharding is that the index can be distributed across the nodes in a cluster for higher availability.


2 Answers

I'm late to the party, but I just wanted to point out a couple of things:

  1. The optimal number of shards per index is always 1. However, that provides no possibility of horizontal scale.
  2. The optimal number of shards per node is always 1. However, then you cannot scale horizontally more than your current number of nodes.

The main point is that shards have an inherent cost to both indexing and querying. Each shard is actually a separate Lucene index. When you run a query, Elasticsearch must run that query against each shard, and then compile the individual shard results together to come up with a final result to send back. The benefit to sharding is that the index can be distributed across the nodes in a cluster for higher availability. In other words, it's a trade-off.

Finally, it should be noted that any more than 1 shard per node will introduce I/O considerations. Since each shard must be indexed and queried individually, a node with 2 or more shards would require 2 or more separate I/O operations, which can't be run at the same time. If you have SSDs on your nodes then the actual cost of this can be reduced, since all the I/O happens much quicker. Still, it's something to be aware of.

That, then, begs the question of why would you want to have more than one shard per node? The answer to that is planned scalability. The number of shards in an index is fixed. The only way to add more shards later is to recreate the index and reindex all the data. Depending on the size of your index that may or may not be a big deal. At the time of writing, Stack Overflow's index is 203GB (see: https://stackexchange.com/performance). That's kind of a big deal to recreate all that data, so resharding would be a nightmare. If you have 3 nodes and a total of 6 shards, that means that you can scale out to up to 6 nodes at a later point easily without resharding.

like image 135
Chris Pratt Avatar answered Sep 30 '22 05:09

Chris Pratt


There are three condition you consider before sharding..

Situation 1) You want to use elasticsearch with failover and high availability. Then you go for sharding. In this case, you need to select number of shards according to number of nodes[ES instance] you want to use in production.

Consider you wanna give 3 nodes in production. Then you need to choose 1 primary shard and 2 replicas for every index. If you choose more shards than you need.

Situation 2) Your current server will hold the current data. But due to dynamic data increase future you may end up with no space on disk or your server cannot handle much data means, then you need to configure more no of shards like 2 or 3 shards (its up to your requirements) for each index. But there shouldn't any replica.

Situation 3) In this situation you the combined situation of situation 1 & 2. then you need to combine both configuration. Consider your data increased dynamically and also you need high availability and failover. Then you configure a index with 2 shards and 1 replica. Then you can share data among nodes and get an optimal performance..!

Note: Then query will be processed in each shard and perform mapreduce on results from all shards and return the result to us. So the map reduce process is expensive process. Minimum shards gives us optimal performance

If you are using only one node in production then, only one primary shards is optimal no of shards for each index.

Hope it helps..!

like image 36
BlackPOP Avatar answered Sep 30 '22 05:09

BlackPOP