I want to get the counts of groups which satisfy a certain condition. In SQL terms, I want to do the following in Elasticsearch.
SELECT COUNT(*) FROM ( SELECT senderResellerId, SUM(requestAmountValue) AS t_amount FROM transactions GROUP BY senderResellerId HAVING t_amount > 10000 ) AS dum;
So far, I could group by senderResellerId by term aggregation. But when I apply filters, it does not work as expected.
Elastic Request
{ "aggregations": { "reseller_sale_sum": { "aggs": { "sales": { "aggregations": { "reseller_sale": { "sum": { "field": "requestAmountValue" } } }, "filter": { "range": { "reseller_sale": { "gte": 10000 } } } } }, "terms": { "field": "senderResellerId", "order": { "sales>reseller_sale": "desc" }, "size": 5 } } }, "ext": {}, "query": { "match_all": {} }, "size": 0 }
Actual Response
{ "took" : 21, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "failed" : 0 }, "hits" : { "total" : 150824, "max_score" : 0.0, "hits" : [ ] }, "aggregations" : { "reseller_sale_sum" : { "doc_count_error_upper_bound" : -1, "sum_other_doc_count" : 149609, "buckets" : [ { "key" : "RES0000000004", "doc_count" : 8, "sales" : { "doc_count" : 0, "reseller_sale" : { "value" : 0.0 } } }, { "key" : "RES0000000005", "doc_count" : 39, "sales" : { "doc_count" : 0, "reseller_sale" : { "value" : 0.0 } } }, { "key" : "RES0000000006", "doc_count" : 57, "sales" : { "doc_count" : 0, "reseller_sale" : { "value" : 0.0 } } }, { "key" : "RES0000000007", "doc_count" : 134, "sales" : { "doc_count" : 0, "reseller_sale" : { "value" : 0.0 } } } } } ] } } }
As you can see from above response, it is returning resellers but the reseller_sale aggregation is zero in results.
More details are here.
The HAVING clause is used instead of WHERE with aggregate functions. While the GROUP BY Clause groups rows that have the same values into summary rows. The having clause is used with the where clause in order to find rows with certain conditions. The having clause is always used after the group By clause.
HAVING Clause always utilized in combination with GROUP BY Clause. HAVING Clause restricts the data on the group records rather than individual records. WHERE and HAVING can be used in a single query.
(When you have a GROUP BY ), the same logic applies for all fields you put in the SELECT list, the HAVING clause and the ORDER BY clause. So, it also applies for ORDER BY UpdateDate .
It means, if different rows in a precise column have the same values, it will arrange those rows in a group. The SELECT statement is used with the GROUP BY clause in the SQL query.
You may use one of the pipeline aggregations
, namely bucket selector aggregation. The query would look like this:
POST my_index/tdrs/_search { "aggregations": { "reseller_sale_sum": { "aggregations": { "sales": { "sum": { "field": "requestAmountValue" } }, "max_sales": { "bucket_selector": { "buckets_path": { "var1": "sales" }, "script": "params.var1 > 10000" } } }, "terms": { "field": "senderResellerId", "order": { "sales": "desc" }, "size": 5 } } }, "size": 0 }
After putting the following documents in the index:
"hits": [ { "_index": "my_index", "_type": "tdrs", "_id": "AV9Yh5F-dSw48Z0DWDys", "_score": 1, "_source": { "requestAmountValue": 7000, "senderResellerId": "ID_1" } }, { "_index": "my_index", "_type": "tdrs", "_id": "AV9Yh684dSw48Z0DWDyt", "_score": 1, "_source": { "requestAmountValue": 5000, "senderResellerId": "ID_1" } }, { "_index": "my_index", "_type": "tdrs", "_id": "AV9Yh8TBdSw48Z0DWDyu", "_score": 1, "_source": { "requestAmountValue": 1000, "senderResellerId": "ID_2" } } ]
The result of the query is:
"aggregations": { "reseller_sale_sum": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ { "key": "ID_1", "doc_count": 2, "sales": { "value": 12000 } } ] } }
I.e. only those senderResellerId
whose cumulative sales are >10000
.
To implement an equivalent of SELECT COUNT(*) FROM (... HAVING)
one may use a combination of bucket script aggregation with sum bucket aggregation. Though there seems to be no direct way to count how many buckets did bucket_selector
actually select, we may define a bucket_script
that produces 0
or 1
depending on a condition, and sum_bucket
that produces its sum
:
POST my_index/tdrs/_search { "aggregations": { "reseller_sale_sum": { "aggregations": { "sales": { "sum": { "field": "requestAmountValue" } }, "max_sales": { "bucket_script": { "buckets_path": { "var1": "sales" }, "script": "if (params.var1 > 10000) { 1 } else { 0 }" } } }, "terms": { "field": "senderResellerId", "order": { "sales": "desc" } } }, "max_sales_stats": { "sum_bucket": { "buckets_path": "reseller_sale_sum>max_sales" } } }, "size": 0 }
The output will be:
"aggregations": { "reseller_sale_sum": { "doc_count_error_upper_bound": 0, "sum_other_doc_count": 0, "buckets": [ ... ] }, "max_sales_stats": { "value": 1 } }
The desired bucket count is located in max_sales_stats.value
.
I have to point out 2 things:
Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree.
This means that bucket_selector
aggregation will be applied after and on the result of terms
aggregation on senderResellerId
. For example, if there are more senderResellerId
than size
of terms
aggregation defines, you will not get all the ids in the collection with sum(sales) > 10000
, but only those that appear in the output of terms
aggregation. Consider using sorting and/or set sufficient size
parameter.
This also applies for the second case, COUNT() (... HAVING)
, which will only count those buckets that are actually present in the output of aggregation.
In case this query is too heavy or the number of buckets too big, consider denormalizing your data or store this sum directly in the document, so you can use plain range
query to achieve your goal.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With