I would appreciate if someone could suggest the optimal number of shards per ES node for optimal performance or provide any recommended way to arrive at the number of shards one should use, given the number of cores and memory foot print.
The default is designed to allow you to split by factors of 2 up to a maximum of 1024 shards. In Elasticsearch 7.0. 0 and later versions, this setting affects how documents are distributed across shards.
Once you set the number of shards for an index in ElasticSearch, you cannot change them. You will need to create a new index with the desired number of shards, and depending on your use case, you may want then to transfer the data to the new index.
primary vs replica shards – elasticsearch will create, by default, 5 primary shards and one replica for each index. That means that each elasticsearch index will be split into 5 chunks and each chunk will have one copy, for high availability.
Though there is technically no limit to how much data you can store on a single shard, Elasticsearch recommends a soft upper limit of 50 GB per shard, which you can use as a general guideline that signals when it's time to start a new index.
Aim for shard sizes between 10GB and 50GB edit Shards larger than 50GB may make a cluster less likely to recover from failure. When a node fails, Elasticsearch rebalances the node’s shards across the data tier’s remaining nodes. Shards larger than 50GB can be harder to move across a network and may tax node resources.
You can also limit the amount of shards a node can have regardless of the index: ( Dynamic ) Maximum number of primary and replica shards allocated to each node. Defaults to -1 (unlimited). Elasticsearch checks this setting during shard allocation.
If node C fails, Elasticsearch reallocates its shard to node B. Reallocating the shard to node A would exceed node A’s shard limit. These settings impose a hard limit which can result in some shards not being allocated.
Each shard is actually a separate Lucene index. When you run a query, Elasticsearch must run that query against each shard, and then compile the individual shard results together to come up with a final result to send back. The benefit to sharding is that the index can be distributed across the nodes in a cluster for higher availability.
I'm late to the party, but I just wanted to point out a couple of things:
The main point is that shards have an inherent cost to both indexing and querying. Each shard is actually a separate Lucene index. When you run a query, Elasticsearch must run that query against each shard, and then compile the individual shard results together to come up with a final result to send back. The benefit to sharding is that the index can be distributed across the nodes in a cluster for higher availability. In other words, it's a trade-off.
Finally, it should be noted that any more than 1 shard per node will introduce I/O considerations. Since each shard must be indexed and queried individually, a node with 2 or more shards would require 2 or more separate I/O operations, which can't be run at the same time. If you have SSDs on your nodes then the actual cost of this can be reduced, since all the I/O happens much quicker. Still, it's something to be aware of.
That, then, begs the question of why would you want to have more than one shard per node? The answer to that is planned scalability. The number of shards in an index is fixed. The only way to add more shards later is to recreate the index and reindex all the data. Depending on the size of your index that may or may not be a big deal. At the time of writing, Stack Overflow's index is 203GB (see: https://stackexchange.com/performance). That's kind of a big deal to recreate all that data, so resharding would be a nightmare. If you have 3 nodes and a total of 6 shards, that means that you can scale out to up to 6 nodes at a later point easily without resharding.
There are three condition you consider before sharding..
Situation 1) You want to use elasticsearch with failover and high availability. Then you go for sharding. In this case, you need to select number of shards according to number of nodes[ES instance] you want to use in production.
Consider you wanna give 3 nodes in production. Then you need to choose 1 primary shard and 2 replicas for every index. If you choose more shards than you need.
Situation 2) Your current server will hold the current data. But due to dynamic data increase future you may end up with no space on disk or your server cannot handle much data means, then you need to configure more no of shards like 2 or 3 shards (its up to your requirements) for each index. But there shouldn't any replica.
Situation 3) In this situation you the combined situation of situation 1 & 2. then you need to combine both configuration. Consider your data increased dynamically and also you need high availability and failover. Then you configure a index with 2 shards and 1 replica. Then you can share data among nodes and get an optimal performance..!
Note: Then query will be processed in each shard and perform mapreduce on results from all shards and return the result to us. So the map reduce process is expensive process. Minimum shards gives us optimal performance
If you are using only one node in production then, only one primary shards is optimal no of shards for each index.
Hope it helps..!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With