SQL like GROUP BY AND HAVING

Tags:

I want to get the counts of groups which satisfy a certain condition. In SQL terms, I want to do the following in Elasticsearch.

SELECT COUNT(*) FROM (    SELECT     senderResellerId,     SUM(requestAmountValue) AS t_amount    FROM     transactions    GROUP BY     senderResellerId    HAVING     t_amount > 10000 ) AS dum;

So far, I could group by senderResellerId by term aggregation. But when I apply filters, it does not work as expected.

Elastic Request

{   "aggregations": {     "reseller_sale_sum": {       "aggs": {         "sales": {           "aggregations": {             "reseller_sale": {               "sum": {                 "field": "requestAmountValue"               }             }           },            "filter": {             "range": {               "reseller_sale": {                  "gte": 10000               }             }           }         }       },        "terms": {         "field": "senderResellerId",          "order": {           "sales>reseller_sale": "desc"         },          "size": 5       }     }   },    "ext": {},    "query": {  "match_all": {} },    "size": 0 }

Actual Response

{   "took" : 21,   "timed_out" : false,   "_shards" : {     "total" : 1,     "successful" : 1,     "failed" : 0   },   "hits" : {     "total" : 150824,     "max_score" : 0.0,     "hits" : [ ]   },   "aggregations" : {     "reseller_sale_sum" : {       "doc_count_error_upper_bound" : -1,       "sum_other_doc_count" : 149609,       "buckets" : [         {           "key" : "RES0000000004",           "doc_count" : 8,           "sales" : {             "doc_count" : 0,             "reseller_sale" : {               "value" : 0.0             }           }         },         {           "key" : "RES0000000005",           "doc_count" : 39,           "sales" : {             "doc_count" : 0,             "reseller_sale" : {               "value" : 0.0             }           }         },         {           "key" : "RES0000000006",           "doc_count" : 57,           "sales" : {             "doc_count" : 0,             "reseller_sale" : {               "value" : 0.0             }           }         },         {           "key" : "RES0000000007",           "doc_count" : 134,           "sales" : {             "doc_count" : 0,             "reseller_sale" : {               "value" : 0.0             }           }         }           }         }       ]     }   } }

As you can see from above response, it is returning resellers but the reseller_sale aggregation is zero in results.

More details are here.

234

asked Oct 24 '17 10:10

damjad

1 Answers

Implementation of HAVING-like behavior

You may use one of the pipeline aggregations, namely bucket selector aggregation. The query would look like this:

POST my_index/tdrs/_search {    "aggregations": {       "reseller_sale_sum": {          "aggregations": {             "sales": {                "sum": {                   "field": "requestAmountValue"                }             },             "max_sales": {                "bucket_selector": {                   "buckets_path": {                      "var1": "sales"                   },                   "script": "params.var1 > 10000"                }             }          },          "terms": {             "field": "senderResellerId",             "order": {                "sales": "desc"             },             "size": 5          }       }    },    "size": 0 }

After putting the following documents in the index:

  "hits": [      {         "_index": "my_index",         "_type": "tdrs",         "_id": "AV9Yh5F-dSw48Z0DWDys",         "_score": 1,         "_source": {            "requestAmountValue": 7000,            "senderResellerId": "ID_1"         }      },      {         "_index": "my_index",         "_type": "tdrs",         "_id": "AV9Yh684dSw48Z0DWDyt",         "_score": 1,         "_source": {            "requestAmountValue": 5000,            "senderResellerId": "ID_1"         }      },      {         "_index": "my_index",         "_type": "tdrs",         "_id": "AV9Yh8TBdSw48Z0DWDyu",         "_score": 1,         "_source": {            "requestAmountValue": 1000,            "senderResellerId": "ID_2"         }      }   ]

The result of the query is:

"aggregations": {       "reseller_sale_sum": {          "doc_count_error_upper_bound": 0,          "sum_other_doc_count": 0,          "buckets": [             {                "key": "ID_1",                "doc_count": 2,                "sales": {                   "value": 12000                }             }          ]       }    }

I.e. only those senderResellerId whose cumulative sales are >10000.

Counting the buckets

To implement an equivalent of SELECT COUNT(*) FROM (... HAVING) one may use a combination of bucket script aggregation with sum bucket aggregation. Though there seems to be no direct way to count how many buckets did bucket_selector actually select, we may define a bucket_script that produces 0 or 1 depending on a condition, and sum_bucket that produces its sum:

POST my_index/tdrs/_search {    "aggregations": {       "reseller_sale_sum": {          "aggregations": {             "sales": {                "sum": {                   "field": "requestAmountValue"                }             },             "max_sales": {                "bucket_script": {                   "buckets_path": {                      "var1": "sales"                   },                   "script": "if (params.var1 > 10000) { 1 } else { 0 }"                }             }          },          "terms": {             "field": "senderResellerId",             "order": {                "sales": "desc"             }          }       },       "max_sales_stats": {          "sum_bucket": {             "buckets_path": "reseller_sale_sum>max_sales"          }       }    },    "size": 0 }

The output will be:

   "aggregations": {       "reseller_sale_sum": {          "doc_count_error_upper_bound": 0,          "sum_other_doc_count": 0,          "buckets": [             ...          ]       },       "max_sales_stats": {          "value": 1       }    }

The desired bucket count is located in max_sales_stats.value.

Important considerations

I have to point out 2 things:

The feature is experimental (as of ES 5.6 it is still experimental, though it was added in 2.0.0-beta1.)
pipeline aggregations are applied on the result of previous aggregations:

Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree.

This means that bucket_selector aggregation will be applied after and on the result of terms aggregation on senderResellerId. For example, if there are more senderResellerId than size of terms aggregation defines, you will not get all the ids in the collection with sum(sales) > 10000, but only those that appear in the output of terms aggregation. Consider using sorting and/or set sufficient size parameter.

This also applies for the second case, COUNT() (... HAVING), which will only count those buckets that are actually present in the output of aggregation.

In case this query is too heavy or the number of buckets too big, consider denormalizing your data or store this sum directly in the document, so you can use plain range query to achieve your goal.

117

answered Nov 05 '22 12:11

Nikolay Vasiliev

Related questions
                            
                                How can I resolve a relative path to absolute path in golang?
                            
                                What is the use of pkg directory in Go?
                            
                                Difference between Lock.withLock and synchronized in Kotlin
                            
                                How can you call a helm 'helper' template from a subchart with the correct context?
                            
                                How can I find the arn of an api gateway stage?
                            
                                Use pathlib for S3 paths
                            
                                Android TextView: Getting "W/StaticLayout: maxLineHeight should not be -1. maxLines:1 lineCount:1" when setting text
                            
                                Bower replacement in Visual Studio 2017 ASP.NET MVC Core Template
                            
                                How to do client-side UI events in Blazor
                            
                                Line break, new line in KDoc
                            
                                What is more efficient: sorted stream or sorting a list?
                            
                                asyncio create_task to run forever

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With