I'm using Elasticsearch 2.3 and I'm trying to perform a two-step computation using a pipeline aggregation. I'm only interested in the final result of my pipeline aggregation but Elasticsearch returns all the buckets information.
Since I have a huge number of buckets (tens or hundreds of millions), this is prohibitive. Unfortunately, I cannot find a way to tell Es not to return all this information.
Here is a toy example. I have an index test-index
with a document type obj
. obj
has two fields, key
and values
.
curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
"value": 100,
"key": "foo"
}'
curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
"value": 20,
"key": "foo"
}'
curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
"value": 50,
"key": "bar"
}'
curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
"value": 60,
"key": "bar"
}'
curl -XPOST 'http://10.10.0.7:9200/test-index/obj' -d '{
"value": 70,
"key": "bar"
}'
I want to get the average value (over all key
s ) of the minimum value
of obj
s having the same key
s.
An average of minima.
Elasticsearch allows me to do this:
curl -XPOST 'http://10.10.0.7:9200/test-index/obj/_search' -d '{
"size": 0,
"query": {
"match_all": {}
},
"aggregations": {
"key_aggregates": {
"terms": {
"field": "key",
"size": 0
},
"aggs": {
"min_value": {
"min": {
"field": "value"
}
}
}
},
"avg_min_value": {
"avg_bucket": {
"buckets_path": "key_aggregates>min_value"
}
}
}
}'
But this query returns the minimum for every bucket, although I don't need it:
{
"took": 21,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0,
"hits": [
]
},
"aggregations": {
"key_aggregates": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "bar",
"doc_count": 2,
"min_value": {
"value": 50
}
},
{
"key": "foo",
"doc_count": 2,
"min_value": {
"value": 20
}
}
]
},
"avg_min_value": {
"value": 35
}
}
}
Is there a way to get rid of all the information inside "buckets": [...]
? I'm only interested in avg_min_value
.
This might not seem like a problem in this toy example, but when the number of different key
s is not big (tens or hundreds of millions), the query response is prohibitively large, and I would like to prune it.
Is there a way to do this with Elasticsearch? Or am I modelling my data wrong?
NB: it is not acceptable to pre-aggregate my data per key, since the match_all
part of my query might be replaced by complex and unknown filters.
NB2: changing size
to a non-negative number in my terms
aggregation is not acceptable because it would change the result.
I had the same issue and after doing quite a bit of research I found a solution and thought I'd share here.
You can use the Response Filtering feature to filter the part of the answer that you want to receive.
You should be able to achieve what you want by adding the query parameter filter_path=aggregations.avg_min_value
to the search URL. In the example case, it should look similar to this:
curl -XPOST 'http://10.10.0.7:9200/test-index/obj/_search?filter_path=aggregations.avg_min_value' -d '{
"size": 0,
"query": {
"match_all": {}
},
"aggregations": {
"key_aggregates": {
"terms": {
"field": "key",
"size": 0
},
"aggs": {
"min_value": {
"min": {
"field": "value"
}
}
}
},
"avg_min_value": {
"avg_bucket": {
"buckets_path": "key_aggregates>min_value"
}
}
}
}'
PS: if you found another solution would you mind sharing it here? Thanks!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With