Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform a pipeline aggregation without returning all buckets in Elasticsearch

I'm using Elasticsearch 2.3 and I'm trying to perform a two-step computation using a pipeline aggregation. I'm only interested in the final result of my pipeline aggregation but Elasticsearch returns all the buckets information.

Since I have a huge number of buckets (tens or hundreds of millions), this is prohibitive. Unfortunately, I cannot find a way to tell Es not to return all this information.

Here is a toy example. I have an index test-index with a document type obj. obj has two fields, key and values.

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 100,
  "key": "foo"
}'

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 20,
  "key": "foo"
}'

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 50,
  "key": "bar"
}'

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 60,
  "key": "bar"
}'

curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
  "value": 70,
  "key": "bar"
}'

I want to get the average value (over all keys ) of the minimum value of objs having the same keys. An average of minima.

Elasticsearch allows me to do this:

curl -XPOST 'http://10.10.0.7:9200/test-index/obj/_search' -d '{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "key_aggregates": {
      "terms": {
        "field": "key",
        "size": 0
      },
      "aggs": {
        "min_value": {
          "min": {
            "field": "value"
          }
        }
      }
    },
    "avg_min_value": {
      "avg_bucket": {
        "buckets_path": "key_aggregates>min_value"
      }
    }
  }
}'

But this query returns the minimum for every bucket, although I don't need it:

{
  "took": 21,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0,
    "hits": [

    ]
  },
  "aggregations": {
    "key_aggregates": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "bar",
          "doc_count": 2,
          "min_value": {
            "value": 50
          }
        },
        {
          "key": "foo",
          "doc_count": 2,
          "min_value": {
            "value": 20
          }
        }
      ]
    },
    "avg_min_value": {
      "value": 35
    }
  }
}

Is there a way to get rid of all the information inside "buckets": [...]? I'm only interested in avg_min_value.

This might not seem like a problem in this toy example, but when the number of different keys is not big (tens or hundreds of millions), the query response is prohibitively large, and I would like to prune it.

Is there a way to do this with Elasticsearch? Or am I modelling my data wrong?

NB: it is not acceptable to pre-aggregate my data per key, since the match_all part of my query might be replaced by complex and unknown filters.

NB2: changing size to a non-negative number in my terms aggregation is not acceptable because it would change the result.

like image 305
jrjd Avatar asked Jun 28 '16 16:06

jrjd


1 Answers

I had the same issue and after doing quite a bit of research I found a solution and thought I'd share here.

You can use the Response Filtering feature to filter the part of the answer that you want to receive.

You should be able to achieve what you want by adding the query parameter filter_path=aggregations.avg_min_value to the search URL. In the example case, it should look similar to this:

curl -XPOST 'http://10.10.0.7:9200/test-index/obj/_search?filter_path=aggregations.avg_min_value' -d '{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "key_aggregates": {
      "terms": {
        "field": "key",
        "size": 0
      },
      "aggs": {
        "min_value": {
          "min": {
            "field": "value"
          }
        }
      }
    },
    "avg_min_value": {
      "avg_bucket": {
        "buckets_path": "key_aggregates>min_value"
      }
    }
  }
}'

PS: if you found another solution would you mind sharing it here? Thanks!

like image 66
fgal Avatar answered Oct 22 '22 18:10

fgal