Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch aggregations on nested inner hits

I got a large amount of data in Elasticsearch. My douments have a nested field called "records" that contains a list of objects with several fields.

I want to be able to query specific objects from the records list, and therefore I use the inner_hits field in my query, but It doesn't help because aggregation uses size 0 so no results are returned.

I didn't succeed to make an aggregation work only for inner_hits, as aggregation returns results for all the objects inside records no matter the query.

This is the query I am using: (Each document has first_timestamp and last_timestamp fields, and each object in the records list has a timestamp field)

curl -XPOST 'localhost:9200/_msearch?pretty' -H 'Content-Type: application/json' -d'    
{
    "index":[
        "my_index"
    ],
    "search_type":"count",
    "ignore_unavailable":true
}
{
    "size":0,
    "query":{
        "filtered":{
             "query":{
                 "nested":{
                     "path":"records",
                     "query":{
                         "term":{
                             "records.data.field1":"value1"
                         }
                     },
                     "inner_hits":{}
                 }
             },
             "filter":{
                 "bool":{
                     "must":[
                     {
                         "range":{
                             "first_timestamp":{
                                 "gte":1504548296273,
                                 "lte":1504549196273,
                                 "format":"epoch_millis"
                             }
                         }
                     }
                     ],
                 }
             }
         }
     },
     "aggs":{
         "nested_2":{
             "nested":{
                 "path":"records"
             },
             "aggs":{
                 "2":{
                     "date_histogram":{
                          "field":"records.timestamp",
                          "interval":"1s",
                          "min_doc_count":1,
                          "extended_bounds":{
                              "min":1504548296273,
                              "max":1504549196273
                          }
                     }
                }
           }
      }
   }
}'
like image 545
hanetz Avatar asked Sep 04 '17 22:09

hanetz


People also ask

How Elasticsearch aggregation works internally?

Elasticsearch Aggregations provide you with the ability to group and perform calculations and statistics (such as sums and averages) on your data by using a simple search query. An aggregation can be viewed as a working unit that builds analytical information across a set of documents.

What is sub aggregation in Elasticsearch?

The sub-aggregations will be computed for the buckets which their parent aggregation generates. There is no hard limit on the level/depth of nested aggregations (one can nest an aggregation under a "parent" aggregation, which is itself a sub-aggregation of another higher-level aggregation).

What is nested aggregation?

Nested aggregationeditA special single bucket aggregation that enables aggregating nested documents. For example, lets say we have an index of products, and each product holds the list of resellers - each having its own price for the product.

What is a reverse nested aggregation?

A special single bucket aggregation that enables aggregating on parent docs from nested documents.


2 Answers

Inner_hits aggregation is not supported by elasticsearch. The reason behind it is that inner_hits is a very expensive operation and applying aggregation on inner_hits is like exponential increase in complexity of operation. Here is the github link of the issue.

If you want aggregation on inner_hits you can probably use the following approach:

  1. Make flexible query where you only get the required hit from elastic and aggregate over it. Repeat it multiple time to get all the hits and aggregate simultaneously. This approach may lead you with multiple search query which is not advisable.
  2. You can make your application layer handle the aggregation logic by writing smart aggregation parser and run those parser on response from elasticsearch. This approach is a little better but you have an overhead of maintaining the parser according to changing needs.

I would personally recommend you to change your data-mapping style in elasticsearch so that you are able to run aggregation on it.

like image 29
Saket Gupta Avatar answered Oct 18 '22 04:10

Saket Gupta


Your query is pretty complex. To be short, here is your requested query:

{
  "size": 0,
  "aggregations": {
    "nested_A": {
      "nested": {
        "path": "records"
      },
      "aggregations": {
        "bool_aggregation_A": {
          "filter": {
            "bool": {
              "must": [
                {
                  "term": {
                    "records.data.field1": "value1"
                  }    
                }
              ]
            }
          },
          "aggregations": {
            "reverse_aggregation": {
              "reverse_nested": {},
              "aggregations": {
                "bool_aggregation_B": {
                  "filter": {
                    "bool": {
                      "must": [
                        {
                          "range": {
                            "first_timestamp": {
                              "gte": 1504548296273,
                              "lte": 1504549196273,
                              "format": "epoch_millis"
                            }
                          }
                        }
                      ]
                    }
                  },
                  "aggregations": {
                    "nested_B": {
                      "nested": {
                        "path": "records"
                      },
                      "aggregations": {
                        "my_histogram": {
                          "date_histogram": {
                            "field": "records.timestamp",
                            "interval": "1s",
                            "min_doc_count": 1,
                            "extended_bounds": {
                              "min": 1504548296273,
                              "max": 1504549196273
                            }
                          }
                        }
                      }
                    }
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Now, let me explain every step by aggregations' names:

  • size: 0 -> we are not interested in hits, only aggregations
  • nested_A -> data.field1 is under records so we dive our scope to records
  • bool_aggregation_A -> filter by data.field1: value1
  • reverse_aggregation -> first_timestamp is not in nested document, we need to scope out from records
  • bool_aggregation_B -> filter by first_timestamp range
  • nested_B -> now, we scope again into records for timestamp field (located under records)
  • my_histogram -> finally, aggregate date histogram by timestamp field
like image 133
Eli Avatar answered Oct 18 '22 04:10

Eli