When do you start additional Elasticsearch nodes? [closed]

Tags:

I'm in the middle of attempting to replace a Solr setup with Elasticsearch. This is a new setup, which has not yet seen production, so I have lots of room to fiddle with things and get them working well.

I have very, very large amounts of data. I'm indexing some live data and holding onto it for 7 days (by using the _ttl field). I do not store any data in the index (and disabled the _source field). I expect my index to stabilize around 20 billion rows. I will be putting this data into 2-3 named indexes. Search performance so far with up to a few billion rows is totally acceptable, but indexing performance is an issue.

I am a bit confused about how ES uses shards internally. I have created two ES nodes, each with a separate data directory, each with 8 indexes and 1 replica. When I look at the cluster status, I only see one shard and one replica for each node. Doesn't each node keep multiple indexes running internally? (Checking the on-disk storage location shows that there is definitely only one Lucene index present). -- Resolved, as my index setting was not picked up properly from the config. Creating the index using the API and specifying the number of shards and replicas has now produced exactly what I would've expected to see.

Also, I tried running multiple copies of the same ES node (from the same configuration), and it recognizes that there is already a copy running and creates its own working area. These new instances of nodes also seem to only have one index on-disk. -- Now that each node is actually using multiple indices, a single node with many indices is more than sufficient to throttle the entire system, so this is a non-issue.

When do you start additional Elasticsearch nodes, for maximum indexing performance? Should I have many nodes each running with 1 index 1 replica, or fewer nodes with tons of indexes? Is there something I'm missing with my configuration in order to have single nodes doing more work?

Also: Is there any metric for knowing when an HTTP-only node is overloaded? Right now I have one node devoted to HTTP only, but aside from CPU usage, I can't tell if it's doing OK or not. When is it time to start additional HTTP nodes and split up your indexing software to point to the various nodes?

638

asked Sep 13 '12 15:09

gdm

1 Answers

Let's clarify the terminology a little first:

Node: an Elasticsearch instance running (a java process). Usually every node runs on its own machine.
Cluster: one or more nodes with the same cluster name.
Index: more or less like a database.
Type: more or less like a database table.
Shard: effectively a lucene index. Every index is composed of one or more shards. A shard can be a primary shard (or simply shard) or a replica.

When you create an index you can specify the number of shards and number of replicas per shard. The default is 5 primary shards and 1 replica per shard. The shards are automatically evenly distributed over the cluster. A replica shard will never be allocated on the same machine where the related primary shard is.

What you see in the cluster status is weird, I'd suggest to check your index settings using the using the get settings API. Looks like you configured only one shard, but anyway you should see more shards if you have more than one index. If you need more help you can post the output that you get from elasticsearch.

How many shards and replicas you use really depends on your data, the way you access them and the number of available nodes/servers. It's best practice to overallocate shards a little in order to redistribute them in case you add more nodes to your cluster, since you can't (for now) change the number of shards once you created the index. Otherwise you can always change the number of shards if you are willing to do a complete reindex of your data.

Every additional shard comes with a cost since each shard is effectively a Lucene instance. The maximum number of shards that you can have per machine really depends on the hardware available and your data as well. Good to know that having 100 indexes with each one shard or one index with 100 shards is really the same since you'd have 100 lucene instances in both cases.

Of course at query time if you want to query a single elasticsearch index composed of 100 shards elasticsearch would need to query them all in order to get proper results (unless you used a specific routing for your documents to then query only a specific shard). This would have a performance cost.

You can easily check the state of your cluster and nodes using the Cluster Nodes Info API through which you can check a lot of useful information, all you need in order to know whether your nodes are running smoothly or not. Even easier, there are a couple of plugins to check those information through a nice user interface (which internally uses the elasticsearch APIs anyway): paramedic and bigdesk.

125

answered Sep 27 '22 19:09

javanna

Related questions
                            
                                Elasticsearch OutOfMemoryError Java heap space
                            
                                ElasticSearch updates are not immediate, how do you wait for ElasticSearch to finish updating it's index?
                            
                                Filter out metadata fields and only return source fields in elasticsearch
                            
                                running Elastic Search as a Windows service
                            
                                ElasticSearch vs SQL Full Text Search [closed]
                            
                                Elasticsearch relationship mappings (one to one and one to many)
                            
                                How do I create a stacked graph of HTTP codes in Kibana?
                            
                                is there any way to import a json file(contains 100 documents) in elasticsearch server.?
                            
                                Regarding elastic search memory usage
                            
                                Limit the number of results returned by Elastic Search
                            
                                What does "Limit of total fields [1000] in index [] has been exceeded" means in Elasticsearch
                            
                                What is an index in Elasticsearch
                            
                                Sync postgreSql data with ElasticSearch
                            
                                Defining analyzer while querying in elasticSearch
                            
                                Multilingual elasticsearch indexing best practice/experiences
                            
                                Combined non-Nested and Nested Query in Elasticsearch
                            
                                Location of custom Kibana dashboards in ElasticSearch
                            
                                Max limit on the number of values I can specify in the ids filter or generally query clause?
                            
                                Change default mapping of string to "not analyzed" in Elasticsearch
                            
                                Kibana on Docker cannot connect to Elasticsearch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When do you start additional Elasticsearch nodes? [closed]

Tags:

elasticsearch

sharding

bigdata

gdm

People also ask

1 Answers

javanna

Recent Activity

Donate For Us