Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sub aggregation with sum/average with Elasticsearch

I have an index transactions with fields user_id, amount and category. I would like to calculate the average amount per user and category, and then finish off with just getting the total average amount per category. The SQL would look like this:

SELET AVG(average), category from

    (SELECT user_id, category, AVG(amount) AS average FROM transactions WHERE amount < 100000 
    GROUP BY user_id, category) AS a1

GROUP BY category

I'm only getting so far as to having a response of a bucket with all user ids, and then inside that, a bucket with the average amounts per category (for the user). I don't understand how to add yet another aggregation to do what I'm after.

{
  "aggs": {
    "group_by_users": {
      "terms": {
        "field": "user_id.keyword"
      },
      "aggs": {
        "group_by_category": {
          "terms": {
              "field": "category.keyword"
          },
          "aggs": {
            "average_amount": {
              "avg": {
                "field": "amount"
              }
            }
          }
        }
      }
    }
  }
}

Any help is very much appreciated.

Edit: Example requested so here is first some sample data and then the intermediate result which will end with the wanted result at the bottom.

-----------------------------------------
|  user_id  |   category   |   amount   |
-----------------------------------------
|     1     |   insurances |   1000     |
|     1     |   transport  |     50     |
|     1     |   transport  |    100     |
|     2     |   insurances |    700     |
|     2     |   insurances |    200     |
|     2     |   transport  |    300     |
-----------------------------------------

Calculation for user 1 transport: (50+100)/2

So, the first thing that needs to happen is a group by user_id and category to get the average per user and category.

That would yield:

-----------------------------------------
|  user_id  |   category   |   average  |
-----------------------------------------
|     1     |   insurances |   1000     |
|     1     |   transport  |     75     |
|     2     |   insurances |    450     |
|     2     |   transport  |    300     |
-----------------------------------------

It's important to understand I can't do an average for all users together, I need the average spend per user, per category, first.

So now I just want to group by category and calculate the average amount:

-----------------------------
|   category   |   average  |
-----------------------------
|   insurances |   725      |
|   transport  |   187,5    |
-----------------------------

Example for insurances: (1000 + 450) / 2

like image 223
Andreas Avatar asked Sep 17 '25 06:09

Andreas


1 Answers

avg_bucket pipeline aggregation will do the job in some cases (but might not scale well with the size of the index, see notes below):

POST myindex1/_search
{
  "size": 0,
  "aggs": {
    "by category": {
      "terms": {
        "field": "category.keyword"
      },
      "aggs": {
        "by user_id": {
          "terms": {
            "field": "user_id"
          },
          "aggs": {
            "avg by user": {
              "avg": {
                "field": "amount"
              }
            }
          }
        },
        "average by user, category": {
          "avg_bucket": {
            "buckets_path": "by user_id>avg by user"
          }
        }
      }
    }
  }
}

This will response that looks something like this:

{
  ...
  "aggregations" : {
    "by category" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "insurances",
          "doc_count" : 3,
          "by user_id" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 2,
                "doc_count" : 2,
                "avg by user" : {
                  "value" : 450.0
                }
              },
              {
                "key" : 1,
                "doc_count" : 1,
                "avg by user" : {
                  "value" : 1000.0
                }
              }
            ]
          },
          "average by user, category" : {
            "value" : 725.0   <--- average for `insurances`
          }
        },
        {
          "key" : "transport",
          "doc_count" : 3,
          "by user_id" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : 1,
                "doc_count" : 2,
                "avg by user" : {
                  "value" : 75.0
                }
              },
              {
                "key" : 2,
                "doc_count" : 1,
                "avg by user" : {
                  "value" : 300.0
                }
              }
            ]
          },
          "average by user, category" : {
            "value" : 187.5      <--- average for `transport`
          }
        }
      ]
    }
  }
}

How does it work?

Let's start with "by user_id" terms aggregation: we ask Elasticsearch to group documents by user_id and to compute average on amount using avg aggregation.

POST myindex1/_search
{
  "size": 0,
  "aggs": {
    "by user_id": {
      "terms": {
        "field": "user_id"
      },
      "aggs": {
        "avg by user": {
          "avg": {
            "field": "amount"
          }
        }
      }
    }
  }
}

This is equivalent to SQL:

SELECT user_id, avg(amount)
FROM my_index
GROUP BY user_id;

So far this is pretty straightforward. But how can we average also on the category now?

We can add another terms aggregation "by category" on top of "by user id" one. Now avg will take into account also the category:

POST myindex1/_search
{
  "size": 0,
  "aggs": {
    "by category": {
      "terms": {
        "field": "category.keyword"
      },
      "aggs": {
        "by user_id": {
          "terms": {
            "field": "user_id"
          },
          "aggs": {
            "avg by user": {
              "avg": {
                "field": "amount"
              }
            }
          }
        }
      }
    }
  }
}

This is equivalent to SQL:

SELECT user_id, category, avg(amount)
FROM my_index
GROUP BY user_id, category;

Can we now use the result from the previous query, and aggregate on category again?

This can be done with avg_bucket pipeline aggregation. The only thing missing is to tell avg_bucket aggregation where exactly to find the buckets to aggregate upon, which is done via a buckets_path expression.

This is how we arrive to the query that I posted in the top, and it will effectively do the equivalent of SQL you posted in the question.

But...

What can go wrong?

The downside of this method is that it does not scale well with the number of documents in the index.

The fact is, a pipeline aggregation only operates with already aggregated data:

Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree.

In our case this means that if there are more than 10 distinct user_id in the index, our averages will not be precise.

This happens because by default terms aggregation only returns top 10 buckets, and the SQL-ish equivalent should be like the following:

SELECT category, avg(avg_amount)
FROM (
    SELECT user_id, category, avg(amount) avg_amount
    FROM my_index
    GROUP BY user_id, category
    LIMIT 10 per user_id
) Q
LIMIT 10;

This limit can be changed via size parameter of terms aggregation.

Another thing to keep in mind is that terms returns approximate document counts, which can affect the averages as well.


Hope that helps!

like image 135
Nikolay Vasiliev Avatar answered Sep 19 '25 07:09

Nikolay Vasiliev