Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

elasticsearch - routing VS. indexing for query performance

I'm planning a strategy for querying millions of docs in date and user directions.

  • Option 1 - indexing by user. routing by date.
  • Option 2 - indexing by date. routing by user.

What are the differences or advantages when using routing or indexing?

like image 628
user3175226 Avatar asked Apr 17 '14 01:04

user3175226


People also ask

Which is used to improve the performance of Elasticsearch?

Scalyr is an event data analytics and log management tool that could replace Elasticsearch. The way Scalyr is designed, you can still use Kibana and the same old queries. But instead of Elasticsearch as the search engine, you'd be using Scalyr. Scalyr is really fast, even at petabyte scale.

How fast is Elasticsearch query?

Elasticsearch is fast. Because Elasticsearch is built on top of Lucene, it excels at full-text search. Elasticsearch is also a near real-time search platform, meaning the latency from the time a document is indexed until it becomes searchable is very short — typically one second.

What is Elasticsearch indexing rate?

With our updated cluster and NVMe usage, we can easily sustain an indexing rate of nearly 5 million records per second (averaging closer to 25,000 records per second per node).


3 Answers

One of the design patterns that Shay Banon @ Elasticsearch recommends is: index by time range, route by user and use aliasing.

Create an index for each day (or a date range) and route documents on user field, so you could 'retire' older logs and you don't need queries to execute on all shards:

$ curl -XPOST localhost:9200/user_logs_20140418 -d '{
    "mappings" : {
        "user_log" : {
            "_routing": {
              "required": true,
              "path": "user"
            },
            "properties" : {
              "user" : { "type" : "string" },
              "log_time": { "type": "date" }
            }
        }
    }
}'

Create an alias to filter and route on users, so you could query for documents of user 'foo':

$ curl -XPOST localhost:9200/_aliases -d '{
  "actions": [{
    "add": {
      "alias": "user_foo",
      "filter": {"term": {"user": "foo"}},
      "routing": "foo"
    }
  }]
}'

Create aliases for time windows, so you could query for documents 'this_week':

$ curl -XPOST localhost:9200/_aliases -d '{
  "actions": [{
    "add": {
      "index": ["user_logs_20140418", "user_logs_20140417", "user_logs_20140416", "user_logs_20140415", "user_logs_20140414"],
      "alias": "this_week"
    }, 
    "remove": {
      "index": ["user_logs_20140413", "user_logs_20140412", "user_logs_20140411", "user_logs_20140410", "user_logs_20140409", "user_logs_20140408", "user_logs_20140407"],
      "alias": "this_week"
    }
  }]
}'

Some of the advantages of this approach:

  • if you search using aliases for users, you hit only shards where the users' data resides
  • if a user's data grows, you could consider creating a separate index for that user (all you need is to point that user's alias to the new index)
  • no performance implications over allocation of shards
  • you could 'retire' older logs by simply closing (when you close indices, they consume practically no resources) or deleting an entire index (deleting an index is simpler than deleting documents within an index)
like image 89
Mahesh Yellai Avatar answered Oct 05 '22 02:10

Mahesh Yellai


Indexing is the process of parsing [Tokenized, filtered] the document that you indexed[Inverted Index]. It's like appendix of an text book.

When the indexed data exceeds one server limit. instead of upgrading server configurations, add another server and share data with them. This process is called as sharding.

If we search it will search in all shards and perform map reduce and return results.If we group similar data together and search some data in specific data means it reduce processing power and increase speed.

Routing is used to store group of data in particular shards.To select a field for routing. The field should be present in all docs,field should not contains different values.

Note:Routing should be used in multiple shards environment[not in single node]. If we use routing in single node .There is no use of it.

like image 29
BlackPOP Avatar answered Oct 05 '22 01:10

BlackPOP


Let's define the terms first.

Indexing, in the context of Elasticsearch, can mean many things:

  • indexing a document: writing a new document to Elasticsearch
  • indexing a field: defining a field in the mapping (schema) as indexed. All fields that you search on need to be indexed (and all fields are indexed by default)
  • Elasticsearch index: this is a unit of configuration (e.g. the schema/mapping) and of data (i.e. some files on disk). It's like a database, in the sense that a document is written to an index. When you search, you can reach out to one or more indices
  • Lucene index: an Elasticsearch index can be divided into N shards. A shard is a Lucene index. When you index a document, that document gets routed to one of the shards. When you search in the index, the search is broadcasted to a copy of each shard. Each shard replies with what it knows, then results are aggregated and sent back to the client

Judging by the context, "indexing by user" and "indexing by date" refers to having one index per user or one index per date interval (e.g. day).

Routing refers to sending documents to shards as I described earlier. By default, this is done quite randomly: a hash range is divided by the number of shards. When a document comes in, Elasticsearch hashes its _id. The hash falls into the hash range of one of the shards ==> that's where the document goes.

You can use custom routing to control this: instead of hashing the _id, Elasticsearch can hash a routing value (e.g. the user name). As a result, all documents with the same routing value (i.e. same user) land on the same shard. Routing can then be used at query time, so that Elasticsearch queries just one shard (per index) instead of N. This can bring massive query performance gains (check slide 24 in particular).

Back to the question at hand, I would take it as "what are the differences or advantages when breaking data down by index or using routing?"

To answer, the strategy should account for:

  • how indexing indexing (writing) is done. If there's heavy indexing, you need to make sure all nodes participate (i.e. write similar amounts of data on the same number of shards), otherwise there will be bottlenecks
  • how data is queried. If queries often refer to a single user's data, it's useful to have data already broken down by user (index per user or routing by user)
    • total number of shards. The more shards, nodes and fields you have, the bigger the cluster state. If the cluster state size becomes large (e.g. larger than a few 10s of MB), it becomes harder to keep in sync on all nodes, leading to cluster instability. As a rule of thumb, you'll want to stay within a few 10s of thousands of shards in a single Elasticsearch cluster

In practice, I've seen the following designs:

  1. one index per fixed time interval. You'll see this with logs (e.g. Logstash writes to daily indices by default)
  2. one index per time interval, rotated by size. This maintains constant index sizes even if write throughput varies
  3. one index "series" (either 1. or 2.) per user. This works well if you have few users, because it eliminates filtering. But it won't work with many users because you'd have too many shards
  4. one index per time interval (either 1. or 2.) with lots of shards and routing by user. This works well if you have many users. As Mahesh pointed out, it's problematic if some users have lots of data, leading to uneven shards. In this case, you need a way to reindex big users into their own indices (see 3.), and you can use aliases to hide this logic from the application.

I didn't see a design with one index per user and routing by date interval yet. The main disadvantage here is that you'll likely write to one shard at a time (the shard containing today's hash). This will limit your write throughput and your ability to balance writes. But maybe this design works well for a high-but-not-huge number of users (e.g. 1K), few writes and lots of queries for limited time intervals.

BTW, if you want to learn more about this stuff, we have an Elasticsearch Operations training, where we discuss a lot about architecture, trade-offs, how Elasticsearch works under the hood. (disclosure: I deliver this class)

like image 31
Radu Gheorghe Avatar answered Oct 05 '22 00:10

Radu Gheorghe