I'm planning a strategy for querying millions of docs in date and user directions. <ul> <li>Option 1 - indexing by user. routing by date.</li> <li>Option 2 - indexing by date. routing by user.</li> </ul> What are the differences or advantages when using routing or indexing?

Let's define the terms first. Indexing, in the context of Elasticsearch, can mean many things: <ul> <li> indexing a document: writing a new document to Elasticsearch</li> <li> indexing a field: defining a field in the mapping (schema) as indexed. All fields that you search on need to be indexed (and all fields are indexed by default)</li> <li> Elasticsearch index: this is a unit of configuration (e.g. the schema/mapping) and of data (i.e. some files on disk). It's like a database, in the sense that a document is written to an index. When you search, you can reach out to one or more indices</li> <li> Lucene index: an Elasticsearch index can be divided into N shards. A shard is a Lucene index. When you index a document, that document gets routed to one of the shards. When you search in the index, the search is broadcasted to a copy of each shard. Each shard replies with what it knows, then results are aggregated and sent back to the client</li> </ul> Judging by the context, "indexing by user" and "indexing by date" refers to having one index per user or one index per date interval (e.g. day). Routing refers to sending documents to shards as I described earlier. By default, this is done quite randomly: a hash range is divided by the number of shards. When a document comes in, Elasticsearch hashes its <code>_id</code>. The hash falls into the hash range of one of the shards ==> that's where the document goes. You can use custom routing to control this: instead of hashing the <code>_id</code>, Elasticsearch can hash a routing value (e.g. the user name). As a result, all documents with the same routing value (i.e. same user) land on the same shard. Routing can then be used at query time, so that Elasticsearch queries just one shard (per index) instead of N. This can bring massive query performance gains (check slide 24 in particular). Back to the question at hand, I would take it as "what are the differences or advantages when breaking data down by index or using routing?" To answer, the strategy should account for: <ul> <li> how indexing indexing (writing) is done. If there's heavy indexing, you need to make sure all nodes participate (i.e. write similar amounts of data on the same number of shards), otherwise there will be bottlenecks</li> <li> how data is queried. If queries often refer to a single user's data, it's useful to have data already broken down by user (index per user or routing by user) <ul> <li> total number of shards. The more shards, nodes and fields you have, the bigger the cluster state. If the cluster state size becomes large (e.g. larger than a few 10s of MB), it becomes harder to keep in sync on all nodes, leading to cluster instability. As a rule of thumb, you'll want to stay within a few 10s of thousands of shards in a single Elasticsearch cluster</li> </ul> </li> </ul> In practice, I've seen the following designs: <ol> <li> one index per fixed time interval. You'll see this with logs (e.g. Logstash writes to daily indices by default)</li> <li> one index per time interval, rotated by size. This maintains constant index sizes even if write throughput varies</li> <li> one index "series" (either 1. or 2.) per user. This works well if you have few users, because it eliminates filtering. But it won't work with many users because you'd have too many shards</li> <li> one index per time interval (either 1. or 2.) with lots of shards and routing by user. This works well if you have many users. As Mahesh pointed out, it's problematic if some users have lots of data, leading to uneven shards. In this case, you need a way to reindex big users into their own indices (see 3.), and you can use aliases to hide this logic from the application.</li> </ol> I didn't see a design with one index per user and routing by date interval yet. The main disadvantage here is that you'll likely write to one shard at a time (the shard containing today's hash). This will limit your write throughput and your ability to balance writes. But maybe this design works well for a high-but-not-huge number of users (e.g. 1K), few writes and lots of queries for limited time intervals. BTW, if you want to learn more about this stuff, we have an Elasticsearch Operations training, where we discuss a lot about architecture, trade-offs, how Elasticsearch works under the hood. (disclosure: I deliver this class)

elasticsearch - routing VS. indexing for query performance

3 Answers

One of the design patterns that Shay Banon @ Elasticsearch recommends is: index by time range, route by user and use aliasing.

Create an index for each day (or a date range) and route documents on user field, so you could 'retire' older logs and you don't need queries to execute on all shards:

$ curl -XPOST localhost:9200/user_logs_20140418 -d '{
    "mappings" : {
        "user_log" : {
            "_routing": {
              "required": true,
              "path": "user"
            },
            "properties" : {
              "user" : { "type" : "string" },
              "log_time": { "type": "date" }
            }
        }
    }
}'

Create an alias to filter and route on users, so you could query for documents of user 'foo':

$ curl -XPOST localhost:9200/_aliases -d '{
  "actions": [{
    "add": {
      "alias": "user_foo",
      "filter": {"term": {"user": "foo"}},
      "routing": "foo"
    }
  }]
}'

Create aliases for time windows, so you could query for documents 'this_week':

$ curl -XPOST localhost:9200/_aliases -d '{
  "actions": [{
    "add": {
      "index": ["user_logs_20140418", "user_logs_20140417", "user_logs_20140416", "user_logs_20140415", "user_logs_20140414"],
      "alias": "this_week"
    }, 
    "remove": {
      "index": ["user_logs_20140413", "user_logs_20140412", "user_logs_20140411", "user_logs_20140410", "user_logs_20140409", "user_logs_20140408", "user_logs_20140407"],
      "alias": "this_week"
    }
  }]
}'

Some of the advantages of this approach:

if you search using aliases for users, you hit only shards where the users' data resides
if a user's data grows, you could consider creating a separate index for that user (all you need is to point that user's alias to the new index)
no performance implications over allocation of shards
you could 'retire' older logs by simply closing (when you close indices, they consume practically no resources) or deleting an entire index (deleting an index is simpler than deleting documents within an index)

answered Oct 05 '22 02:10

Mahesh Yellai

Indexing is the process of parsing [Tokenized, filtered] the document that you indexed[Inverted Index]. It's like appendix of an text book.

When the indexed data exceeds one server limit. instead of upgrading server configurations, add another server and share data with them. This process is called as sharding.

If we search it will search in all shards and perform map reduce and return results.If we group similar data together and search some data in specific data means it reduce processing power and increase speed.

Routing is used to store group of data in particular shards.To select a field for routing. The field should be present in all docs,field should not contains different values.

Note:Routing should be used in multiple shards environment[not in single node]. If we use routing in single node .There is no use of it.

answered Oct 05 '22 01:10

BlackPOP

Let's define the terms first.

Indexing, in the context of Elasticsearch, can mean many things:

indexing a document: writing a new document to Elasticsearch
indexing a field: defining a field in the mapping (schema) as indexed. All fields that you search on need to be indexed (and all fields are indexed by default)
Elasticsearch index: this is a unit of configuration (e.g. the schema/mapping) and of data (i.e. some files on disk). It's like a database, in the sense that a document is written to an index. When you search, you can reach out to one or more indices
Lucene index: an Elasticsearch index can be divided into N shards. A shard is a Lucene index. When you index a document, that document gets routed to one of the shards. When you search in the index, the search is broadcasted to a copy of each shard. Each shard replies with what it knows, then results are aggregated and sent back to the client

Judging by the context, "indexing by user" and "indexing by date" refers to having one index per user or one index per date interval (e.g. day).

Routing refers to sending documents to shards as I described earlier. By default, this is done quite randomly: a hash range is divided by the number of shards. When a document comes in, Elasticsearch hashes its _id. The hash falls into the hash range of one of the shards ==> that's where the document goes.

You can use custom routing to control this: instead of hashing the _id, Elasticsearch can hash a routing value (e.g. the user name). As a result, all documents with the same routing value (i.e. same user) land on the same shard. Routing can then be used at query time, so that Elasticsearch queries just one shard (per index) instead of N. This can bring massive query performance gains (check slide 24 in particular).

Back to the question at hand, I would take it as "what are the differences or advantages when breaking data down by index or using routing?"

To answer, the strategy should account for:

how indexing indexing (writing) is done. If there's heavy indexing, you need to make sure all nodes participate (i.e. write similar amounts of data on the same number of shards), otherwise there will be bottlenecks
how data is queried. If queries often refer to a single user's data, it's useful to have data already broken down by user (index per user or routing by user)
- total number of shards. The more shards, nodes and fields you have, the bigger the cluster state. If the cluster state size becomes large (e.g. larger than a few 10s of MB), it becomes harder to keep in sync on all nodes, leading to cluster instability. As a rule of thumb, you'll want to stay within a few 10s of thousands of shards in a single Elasticsearch cluster

In practice, I've seen the following designs:

one index per fixed time interval. You'll see this with logs (e.g. Logstash writes to daily indices by default)
one index per time interval, rotated by size. This maintains constant index sizes even if write throughput varies
one index "series" (either 1. or 2.) per user. This works well if you have few users, because it eliminates filtering. But it won't work with many users because you'd have too many shards
one index per time interval (either 1. or 2.) with lots of shards and routing by user. This works well if you have many users. As Mahesh pointed out, it's problematic if some users have lots of data, leading to uneven shards. In this case, you need a way to reindex big users into their own indices (see 3.), and you can use aliases to hide this logic from the application.

I didn't see a design with one index per user and routing by date interval yet. The main disadvantage here is that you'll likely write to one shard at a time (the shard containing today's hash). This will limit your write throughput and your ability to balance writes. But maybe this design works well for a high-but-not-huge number of users (e.g. 1K), few writes and lots of queries for limited time intervals.

BTW, if you want to learn more about this stuff, we have an Elasticsearch Operations training, where we discuss a lot about architecture, trade-offs, how Elasticsearch works under the hood. (disclosure: I deliver this class)

answered Oct 05 '22 00:10

Radu Gheorghe

Related questions
                            
                                Index a dynamic object using NEST
                            
                                How to check if ElasticSearch client is connected?
                            
                                elasticsearch 2 node cluster: proper setup?
                            
                                Elasticsearch: Order of filters for best performance
                            
                                Elastisearch update by query
                            
                                elasticsearch what is the difference between best_field and most_field
                            
                                Can I customize Elastic Search to use my own Stop Word list?
                            
                                Search a nested field for multiple values on the same field with elasticsearch
                            
                                Elasticsearch filter aggregations on minimal doc count
                            
                                How to combine multiple bool queries in elasticsearch
                            
                                Elasticsearch process memory locking failed
                            
                                Symbols in query-string for elasticsearch
                            
                                Using multiple node clients in elasticsearch
                            
                                type keyword and not analyzed, any difference?
                            
                                How to perform indices query in ElasticSearch?
                            
                                How to store money in elasticsearch
                            
                                Delete documents older than 30 days in elasticsearch [closed]
                            
                                ElasticSearch vs. ElasticSearch+Cassandra
                            
                                Configure elasticsearch mapping with java api
                            
                                Removing an index from an alias in Elasticsearch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

elasticsearch - routing VS. indexing for query performance

Tags:

elasticsearch

user3175226

People also ask

3 Answers

Mahesh Yellai

BlackPOP

Radu Gheorghe

Recent Activity

Donate For Us