Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What differs between post-filter and global aggregation for faceted search?

A common problem in search interfaces is that you want to return a selection of results, but might want to return information about all documents. (e.g. I want to see all red shirts, but want to know what other colors are available).

This is sometimes referred to as "faceted results", or "faceted navigation". the example from the Elasticsearch reference is quite clear in explaining why / how, so I've used this as a base for this question.

Summary / Question: It looks like I can use both a post-filter or a global aggregation for this. They both seem to provide the exact same functionality in a different way. There might be advantages or disadvantages to them that I don't see? If so, which should I use?

I have included a complete example below with some documents and a query with both types of method based on the example in the reference guide.


Option 1: post-filter

see the example from the Elasticsearch reference

What we can do is have more results in our origional query, so we can aggregate 'on' those results, and afterwards filter our actual results.

The example is quite clear in explaining it:

But perhaps you would also like to tell the user how many Gucci shirts are available in other colors. If you just add a terms aggregation on the color field, you will only get back the color red, because your query returns only red shirts by Gucci.

Instead, you want to include shirts of all colors during aggregation, then apply the colors filter only to the search results.

See for how this would look below in the example code.

An issue with this is that we cannot use caching. This is in the (not yet available for 5.1) elasticsearch guide warned about:

Performance consideration Use a post_filter only if you need to differentially filter search results and aggregations. Sometimes people will use post_filter for regular searches.

Don’t do this! The nature of the post_filter means it runs after the query, so any performance benefit of filtering (such as caches) is lost completely.

The post_filter should be used only in combination with aggregations, and only when you need differential filtering.

There is however a different option:

Option 2: global aggregations

There is a way to do an aggregation that is not influenced by the search query. So instead of getting a lot, aggregate on that, then filter, we just get our filtered results, but do aggregations on everything. Take a look at the reference

We can get the exact same results. I did not read any warnings about caching for this, but it seems like in the end we need to do about the same amount of work. So that maybe the only ommission.

It is a tiny bit more complicated because of the sub-aggregation we need (you can't have global and a filter on the same 'level').

The only complaint I read about queries using this, is that you might have to repeat yourself if you need to do this for several items. In the end we can generate most queries, so repeating oneself isn't that much of an issue for my usecase, and I do not really consider this an issue on par with "can not use cache".

Question

It seems both functions are overlapping in the least, or possibly providing the exact same functionality. This baffles me. Apart from that, I'd like to know if one or the other has an advantage I haven't seen, and if there is any best practice here?

Example

This is largely from the post-filter reference page, but I added the global filter query.

mapping and documents

PUT /shirts
{
    "mappings": {
        "item": {
            "properties": {
                "brand": { "type": "keyword"},
                "color": { "type": "keyword"},
                "model": { "type": "keyword"}
            }
        }
    }
}

PUT /shirts/item/1?refresh
{
    "brand": "gucci",
    "color": "red",
    "model": "slim"
}

PUT /shirts/item/2?refresh
{
    "brand": "gucci",
    "color": "blue",
    "model": "slim"
}


PUT /shirts/item/3?refresh
{
    "brand": "gucci",
    "color": "red",
    "model": "normal"
}


PUT /shirts/item/4?refresh
{
    "brand": "gucci",
    "color": "blue",
    "model": "wide"
}


PUT /shirts/item/5?refresh
{
    "brand": "nike",
    "color": "blue",
    "model": "wide"
}

PUT /shirts/item/6?refresh
{
    "brand": "nike",
    "color": "red",
    "model": "wide"
}

We are now requesting all red gucci shirts (item 1 and 3), the types of shirts we have (slim and normal) for these 2 shirts, and which colors gucci there are (red and blue).

First, a post filter: get all shirts, aggregate the models for red gucci shirts and the colors for gucci shirts (all colors), and post-filter for red gucci shirts to show only those as results: (this is a bit different from the example, as we try to get it as close to a clear application of postfilters as possilbe.)

GET /shirts/_search
{
  "aggs": {
    "colors_query": {
      "filter": {
        "term": {
          "brand": "gucci"
        }
      },
      "aggs": {
        "colors": {
          "terms": {
            "field": "color"
          }
        }
      }
    },
    "color_red": {
      "filter": {
        "bool": {
          "filter": [
            {
              "term": {
                "color": "red"
              }
            },
            {
              "term": {
                "brand": "gucci"
              }
            }
          ]
        }
      },
      "aggs": {
        "models": {
          "terms": {
            "field": "model"
          }
        }
      }
    }
  },
  "post_filter": {
    "bool": {
      "filter": [
        {
          "term": {
            "color": "red"
          }
        },
        {
          "term": {
            "brand": "gucci"
          }
        }
      ]
    }
  }
}

We could also get all red gucci shirts (our origional query), and then do a global aggregation for the model (for all red gucci shirts) and for color (for all gucci shirts).

GET /shirts/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "color": "red"   }},
        { "term": { "brand": "gucci" }}
      ]
    }
  },
  "aggregations": {
    "color_red": {
      "global": {},
      "aggs": {
        "sub_color_red": {
          "filter": {
            "bool": {
              "filter": [
                { "term": { "color": "red"   }},
                { "term": { "brand": "gucci" }}
              ]
            }
          },
          "aggs": {
            "keywords": {
              "terms": {
                "field": "model"
              }
            }
          }
        }
      }
    },
    "colors": {
      "global": {},
      "aggs": {
        "sub_colors": {
          "filter": {
            "bool": {
              "filter": [
                { "term": { "brand": "gucci" }}
              ]
            }
          },
          "aggs": {
            "keywords": {
              "terms": {
                "field": "color"
              }
            }
          }
        }
      }
    }
  }
}

Both will return the same information, the second one only differs because of the extra level introduced by the sub-aggregations. The second query looks a bit more complex, but I don't think this is very problematic. A real world query is generated by code, probably way more complex anyway and it should be a good query and if that means complicated, so be it.

like image 474
Nanne Avatar asked Oct 17 '22 20:10

Nanne


1 Answers

The actual solution we used, while not a direct answer to the question, is basically "neither".

From this elastic blogpost we got the initial hint:

Occasionally, I see an over-complicated search where the goal is to do as much as possible in as few search requests as possible. These tend to have filters as late as possible, completely in contrary to the advise in Filter First. Do not be afraid to use multiple search requests to satisfy your information need. The multi-search API lets you send a batch of search requests.

Do not shoehorn everything into a single search request.

And that is basically what we are doing in above query: a big bunch of aggregations and some filtering.

Having them run in parallel proved to be much and much quicker. Have a look at the multi-search API

like image 92
Nanne Avatar answered Oct 20 '22 10:10

Nanne