I'm learning elastic search, and there's still a lot I don't get, but one thing I can't figure out (or find all that much on) is when to use one index, and when to use more. Part of this is that I definitely don't get what, exactly, an elastic search index is.
Can you explain what an elastic search index is, and when you should use just one for all your data, and when you should split your data up into multiple indexes?
Bonus points / alternatively, how can I tell when I need to split my data into multiple indexes, and then, how should I decide how to split the data amongst the new indexes?
Elasticsearch features a powerful scale-out architecture based on a feature called Sharding. As document volumes grow for a given index, users can add more shards without changing their applications for the most part. Another option available to users is the use of multiple indexes.
Reindex is the concept of copying existing data from a source index to a destination index which can be inside the same or a different cluster. Elasticsearch has a dedicated endpoint _reindex for this purpose. A reindexing is mostly required for updating mapping or settings.
Why do you need one per user? What sort of data is it? Elasticsearch does not impose any strict limit to the number of indices or shards, but that does not mean that there are not practical limits. Having an index per user adds a lot of flexibility and isolation, but unfortunately does not scale well at all.
You can think about it as a Schema in SQL database.
A Schema contains the data for a given use case. An index holds the data for the use case.
The cool thing is that search can be done on multiple indices in one single request.
It's hard to tell you more without any information about the use case. It depends on many factors: do you need to remove some data after a period (let's say every year)? How many documents will you index and what is the size of a document?
For example, let's say that you want to index logs and keep on line 3 months of logs. You will basically create one index per month and one alias on top of the 3 current months.
When a month is over, create a new index for the new month, modify the alias and remove the old index. Removing an index is efficient performance and disk space wise!
So basically in that case I would recommend using more than one index.
Imagine another situation. Let's say you are launching a game and you don't know exactly if you will be successful or not. So start with an index1 with only one shard and create an alias index on top of it. You launch the game and you find that you will need more power (more machines) as your response time is increasing dramatically. Create a new index index2 with two shards and add it to your alias index.
This way you can scale out easily.
The key point here is IMHO aliases. Use aliases for search from the start of your project. It will help you a lot in the future.
Another use case could be that you are working for different customers. Customers don't want to have their data mixed with other customers. So may be you need in that case to create one index per customer?
The fact is that elasticsearch is very flexible and helps you to design your architecture as you need.
Hope this helps.
The largest single unit of data in elasticsearch
is an index
. Indexes are logical and physical partitions of documents within elasticsearch.
Elasticsearch indexes
are most similar to the database
abstraction in the relational world. An elasticsearch index
is a fully partitioned universe within a single running server instance. Documents and type mappings are scoped per index
, making it safe to re-use names and ids across indexes
. Indexes also have their own settings for cluster replication, sharding, custom text analysis, and many other concerns.
For your reference :- Shards and replicas in Elasticsearch
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With