Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch: Order of filters for best performance

The Elasticsearch guide says

"Each filter is calculated and cached independently, regardless of where it is used. If two different queries use the same filter, the same filter bitset will be reused. Likewise, if a single query uses the same filter in multiple places, only one bitset is calculated and then reused." (https://www.elastic.co/guide/en/elasticsearch/guide/current/filter-caching.html)

on another page it also says:

"The order of filters in a bool clause is important for performance. More-specific filters should be placed before less-specific filters in order to exclude as many documents as possible, as early as possible. If Clause A could match 10 million documents, and Clause B could match only 100 documents, then Clause B should be placed before Clause A." (https://www.elastic.co/guide/en/elasticsearch/guide/current/_filter_order.html)

I do not quite understand how the order of filters in a bool clause is important when each filter is cached independently.

I would imagine that Clause B is executed or retrieved from the cache, Clause A is executed or retrieved from the cache and then the filter bitsets are 'merged'. Why would the order matter?

like image 691
Ronald Avatar asked Jan 11 '16 17:01

Ronald


People also ask

Which is used to improve the performance of Elasticsearch?

Load balancing is a feature that distributes the load coming to an endpoint across multiple nodes. This reduces the load on each node, thus increasing performance. Load balancing in Elasticsearch is rather easy. Load balancers are a part of the Elasticsearch cluster by default.

What makes Elasticsearch fast?

Elasticsearch is fast. Because Elasticsearch is built on top of Lucene, it excels at full-text search. Elasticsearch is also a near real-time search platform, meaning the latency from the time a document is indexed until it becomes searchable is very short — typically one second.


3 Answers

this blog post on elastic website posted in May, 2017 says

Q: Does the order in which I put my queries/filters in the query DSL matter?

A: No, because they will be automatically reordered anyway based on their respective costs and match costs.

like image 167
Devi Avatar answered Nov 08 '22 04:11

Devi


This guidance is a little misleading. It is more complicated and it is very hard to try to write one set of rules that fits all situations. As data changes, the rules change. As query and filter types change, the rules change. A specific filter might be slower to execute than a broad one, the rules change. On a per segment basis the result size of a filter might be opposite than on another segment, it isn't always predictable. So first you have to understand more of the internals, then you need to let go of trying to control it as you move into modern Elasticsearch 2.x.

NOTE: your second quote (filter order) and associated link is to a page that is considered "out of date" for Elasticsearch 2.x, it will be updated later. Therefore the advice may or may not apply to modern times.

Looking back in time to Elasticsearch 1.x and the reason for the ordering suggestion:

Let's talk first about how filters are represented in memory. They are either an iterated list of matching documents, or a random access "is it here" model. Depending on the type of filter, depends on which is more efficient. Now if everything is cached, you are just intersecting them and the cost will vary by size and type.

If filters are not cached, but are cacheable then a filter will execute independently and the previous filters will only affect it by the total cost of intersection.

If the filter is NOT cacheable then it COULD be guided by the previous results. Imagine a Query plus a Filter. If you execute the query, and after apply the filter, you are doing a lot of extra work if the filter limits to a very small set of records. You wasted time in the query with collecting, scoring, and overall building a big set of results. But if you convert to a FilteredQuery and do both at the same time, then the Query ignores all records already eliminated by the Filter. It only has to consider the same documents already in play. This is called "skipping". Not all filter types take advantage of skipping, but some can. And this is why a smaller "guiding" filter will make others using it faster.

Unless you know each filter type, the heuristics of your data, and how each specific filter will be affected by each of these, you just do not have enough information other than to say "put most limiting filters first, and larger ones second" and hope it works out. For bool the default is not to cache its overall result so you have to pay attention to its repeated performance (and/or cache it). It is more efficient when one side of the filter intersection is small. So having a small one to start with makes all the other intersections faster because they can only get smaller. If it were a bool query instead of a filter doing scoring it is even more important to avoid scoring more documents than necessary.

One other important note is that "most specific filter first" sometimes can be slow (script filter, or other), so it should really read: "lowest cost, most specific filters first".

With Elasticsearch 2.0, things will change:

It’s time to forget everything you knew about queries and filters: Elasticsearch 2.0 will make much better decisions by itself instead of relying on users to formulate an optimized query.

In 2.x you should try less to game the system, and let the engine make the best choices. The engine actually may end up with something quite different under the hood, a rewritten filter, a complete change in internal structure and data. And you may not even control the caching anymore. So you need to read more about that.

The previous filter API could be consumed in two ways: either using iterators over matching documents, or using an optional random-access API that allowed to check if a particular document matched the filter or not. Everything is good so far, except that the best way to consume a filter depended on which kind of filter you had: for instance the script filter was more efficient when using the random-access API while the bool filter was more efficient using the iterator API. This was quite a nightmare to optimize and was the root cause why the bool filter on the one hand and the and and or filters on the other hand performed differently.

The engine will now decide what is best taking more factors into consideration including scoring, estimation of result size, best way to intersect related filters, maybe even on a per segment basis, and more.

Also this article makes it clear that even caching can be misleading, it doesn't always make things faster. Sometimes an internal data structure is better when originally used, than the bitset structure that is always cached. So also in 2.x this is changing to avoid caching things that execute better from the native data structure without caching at all.

In the blog post Roaring Bitmaps are more details:

Clearly the most important requirement is to have something fast: if your cached filter is slower than executing the filter again, it is not only consuming memory but also making your queries slower. The more sophisticated an encoding is, the more likely it is to slow down encoding and decoding because of the increased CPU usage

Here you get a lot of information about the internal data structures, caching, intersection and more on the internal changes in 2.x which will help you to have more depth in your understanding of filter performance.

While it may surprise you if you are new to search engine internals, one of the most important building blocks of a search engine is the ability to efficiently compress and quickly decode sorted lists of integers.

From these last few 2.x blog links you have a lot of background about your question, they talk about all of the issues you are trying to work around with filter ordering. The information and details are all there and you can have a better understanding of 1.x vs. 2.x and how queries+filters are solved. So remember:

There is no particular implementation which is constantly better than all others.

See also these 1.x resources for historical reference:

  • Optimizing Elasticsearch searches covers a bit more about filter ordering. It says in summary:

    That said, you still need to think about which order you filter in. You want the more selective filters to run first. Say you filter on type: book and tag: elasticsearch. If you have 30 million documents, 10 million of type book and only 10 tagged Elasticsearch, you’ll want to apply the tag filter first. It reduces the number of documents much more than the book filter does.

  • All About Elasticsearch Filter Bitsets is considered an obsolete article for modern times, but it gives more background about the filter ordering document you quoted.

  • A forum answer by Martijn v Groningen seems to say the opposite about bool vs. and queries about which uses iteration vs. random access, but the idea is the same for each: be safe by limiting documents earlier in the filter list -- regardless of which model is for one type versus the other.

like image 33
Jayson Minard Avatar answered Nov 08 '22 04:11

Jayson Minard


Not all filters are cached/cacheable. For instance, a date range filter using the now variable is not cached because it changes all the time. if you look a bit further down in the first link you gave, you'll see a section named "Controlling caching", which states this fact:

Certain leaf filters, however, are not cached by default, because it doesn’t make sense to do so: script filters, geo filters, date range filters.

To illustrate this, let's say we have the following date range filter (let's call it filter A) which filters all documents from the past month

"range" : {
    "timestamp" : {
         "gt" : "now-1m"
    }
}

and another term filter (let's call it filter B) to filter documents with the type XYZ

"term" : {
    "type" : "XYZ"
}

It makes a big difference (performance wise) if you place

  1. filter A before filter B or
  2. filter B before filter A

In case 1, the execution will be slower because all documents from the past month will need to go through filter A first, which is not cached.

In case 2, you first filter out all the documents without the type XYZ, which is fast because filter B is cached. Then the documents that made it through filter B can go through filter A. So even though filter A is not cached, the execution will still be faster since there are mush less documents left in the filter pipeline.

That was a very simple example, but it should show why filter order matters, i.e. mainly because certain filters are not cached. You may change that default behavior by forcing the caching, but sometimes it's not a good idea. The best practice is to apply the most aggressive filters first so as to let as few documents as possible go through the next filter.

I personally call it the "bulldozer approach", i.e. first make sure to process as much material as possible as early as possible in the filter pipeline, and you eventually end up with a more chewable chunk of data that can be processed much faster.

like image 43
Val Avatar answered Nov 08 '22 03:11

Val