Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ElasticSearch - get all available filters (aggregate) from index

Let's say I have:

"hits": [
      {
        "_index": "products",
        "_type": "product",
        "_id": "599c2b3fc991ee0a597034fa",
        "_score": 1,
        "_source": {,
          "attributes": {
            "1": [
              "a"
            ],
            "2": [
              "b",
              "c"
            ],
            "3": [
              "d",
              "e"
            ],
            "4": [
              "f",
              "g"
            ],
            "5": [
              "h",
              "i"
            ]
          }
        }
      },
      {
        "_index": "products",
        "_type": "product",
        "_id": "599c4bb4b970c25976ced8bd",
        "_score": 1,
        "_source": {
          "attributes": {
            "1": [
              "z"
            ],
            "2": [
              "y"
            ]
          }
        }

Each product has attributes. Each attribute has ID and a value. I can filter the products by attributes fine but for now I am creating the "possible attributes" list from MongoDB. I would like to find a way to generate such a list from ElasticSearch alone (and maybe just query MongoDB for additional data).

What I need is:

{
  1: [a, z],
  2: [b, c, y],
  etc.
}

How would such a aggregation look like? Get all available attributes (grouped by attribute.id) with all of their possible values (throughout all products)?

like image 650
Tomasz Gałkowski Avatar asked Sep 19 '17 11:09

Tomasz Gałkowski


People also ask

Is Elasticsearch good for aggregations?

Elasticsearch Aggregations provide you with the ability to group and perform calculations and statistics (such as sums and averages) on your data by using a simple search query. An aggregation can be viewed as a working unit that builds analytical information across a set of documents.

What field is used by ES to return the filtered query results?

By default, Elasticsearch sorts matching search results by relevance score, which measures how well each document matches a query. The relevance score is a positive floating point number, returned in the _score metadata field of the search API.

What is the difference between query and filter in Elasticsearch?

Queries are slower it returns a calculated score of how well a document matches the query. Filters are faster because they check only if the document matched or not. Queries produce non-boolean values. Filters produce boolean values.


1 Answers

You cannot do it in one query but it is fairly easy in two:

Retrieving the list of attributes

You can use mapping to get all the fields in your documents:

curl -XGET "http://localhost:9200/your_index/your_type/_mapping"

Retrieving their values

You can then use multiple Terms aggregation to get all the values of a field:

curl -XGET "http://localhost:9200/your_index/your_type/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "field1Values": {
      "terms": {
        "field": "field1",
        "size": 20
      }
    },
    "field2Values": {
      "terms": {
        "field": "field2",
        "size": 20
      }
    },
    "field3Values": {
      "terms": {
        "field": "field3",
        "size": 20
      }
    },
    ...
  }
}'

This retrieve the top 20 most frequents values for each field.

This limit of 20 values is a restriction to prevent a huge response (if you have a few billion documents with a unique fields for instance). You can modify the "size" parameters of the terms aggregation to increase it. From your requirements I guess choosing something 10x larger than a rough estimate of the number of different values taken by each field should do the trick.

How to handle huge cardinality on values

You can also do an intermediate query using the cardinality aggregation to get this actual value and then use it as the size of your term aggregation. Please note than cardinality is an estimate when it comes to large number so you may want to use cardinality * 2.

curl -XGET "http://localhost:9200/your_index/your_type/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "field1Cardinality": {
      "cardinality": {
        "field": "field1"
      }
    },
    "field2Cardinality": {
      "cardinality": {
        "field": "field2"
      }
    },
    "field3Cardinality": {
      "cardinality": {
        "field": "field3"
      }
    },
    ...
  }
}'

How to handle huge cardinality on values

The previous works if there is not so many different attributes. If there is, you should alter how the documents are stored to prevent a Mapping explosion,

Storing them like this:

{
    "attributes":[
        {
            "name":"1",
            "value":[
                "a"
            ]
        },
        {
            "name":"2",
            "value":[
                "b",
                "c"
            ]
        },
        {
            "name":"3",
            "value":[
                "d",
                "e"
            ]
        },
        {
            "name":"4",
            "value":[
                "f",
                "g"
            ]
        },
        {
            "name":"5",
            "value":[
                "h",
                "i"
            ]
        }
    ]
}

Would fix the problem and you will be able to use a term aggregation on "name" and then a sub terms aggregation on "value" to get what you want:

curl -XGET "http://localhost:9200/your_index/your_type/_search" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "attributes": {
      "terms": {
        "field": "attributes.name",
        "size": 1000
      },
      "aggs": {
        "values": {
          "terms": {
            "field": "attributes.value",
            "size": 100
          }
        }
      }
    }
  }
}'

It requires to use a Nested mapping for attributes.

like image 116
Pandawan Avatar answered Nov 15 '22 07:11

Pandawan