ElasticSearch - How does sharding affect indexing performance?

1 Answers

Just to have you on the same page:

Your data is organized in indices, each made of shards and distributed across multiple nodes. If a new document needs to be indexed, a new id is being generated and the destination shard is being calculated based on this id. After that, the write is delegated to the node, which is holding the calculated destination shard. This will distribute your documents pretty well across all of your shards.

Finding documents by id is now easy, as the shard, containing the wanted document, can be calulated just based on the id. There is no need for searching all shards. BTW, that's the reason why you can't change the number of shards afterwards. The changed shard number will result in a different document distribution across your shards.

Now, just to make it clear, each shard is a separate lucene index, made of segment files located on your disk. When writing, new segments will be created. If a particular number of segment files will be reached, the segments will be merged. So just introducing more shards without distributing them to other nodes will just introduce a higher I/O and memory consumption for your single node. While searching, the query will be executed against each shard. Afterwards the results of all shards needs to be merged into one result - more shards, more cpu work to do...

Coming back to your question:

For your write heavy indexing case, with just one node, the optimal number of indices and shards is 1!

But for the search case (not accessing by id), the optimal number of shards per node is the number of CPUs available. In such a way, searching can be done in multiple threads, resulting in better search performance. Correction: Searching and indexing are multithreaded, a single shard can fully utilize all CPU cores from a node.

But what are the benefits of sharding?

Availability: By replicating the shards to other nodes you can still serve if some of your nodes can´t be reached anymore!
Performance: Distibuting the primary shards to different nodes, will distribute the workload too.

So if your scenario is write heavy, keep the number of shards per index low. If you need better search performance, increase the number of shards, but keep the "physics" in mind. If you need reliability, take the number of nodes/replicas into account.

ibexit

Related questions
                            
                                Setting not_analyzed for a property in Nest 5.5.0
                            
                                Elasticsearch js how to upsert a document
                            
                                angular 6 ERROR ReferenceError: "process is not defined" with elasticsearch js
                            
                                What are the pros and cons of Solr & ElasticSearch?
                            
                                tools to work with json and curl from the console
                            
                                Limiting the number of results of should clauses in Elastic Search
                            
                                Elasticsearch highlighting on ngram filter is weird if min_gram is set to 1
                            
                                SettingUp ElasticSearch Logstash
                            
                                Elastic Search size to unlimited
                            
                                How to use "suggest" in elasticsearch pyes?
                            
                                ElasticSearch or Couchbase or something else
                            
                                How to retrieve older version document from elasticsearch?
                            
                                Why is ElasticSearch match query returning all results?
                            
                                Elasticsearch field name aliasing
                            
                                Elasticsearch NEST filter by date range
                            
                                Set Request Timeout in Elastic Search for bulk loads [duplicate]
                            
                                Permission denied when chown on elasticsearch data directory in kubernetes statefulset
                            
                                Elasticsearch Painless calculate score from nested elements
                            
                                How do I create a (dockerized) Elasticsearch index using a python script running in a docker container?
                            
                                SyntaxError: Unexpected token { at exports.runInThisContext (vm.js:53:16) in elasticdump

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

ElasticSearch - How does sharding affect indexing performance?

Tags:

elasticsearch

Reza Same'ei

People also ask

1 Answers

ibexit

Recent Activity

Donate For Us