Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL-style GROUP BY aggregate functions in jq (COUNT, SUM and etc)

Similar questions asked here before:

Count items for a single key: jq count the number of items in json by a specific key

Calculate the sum of object values: How do I sum the values in an array of maps in jq?

Question

How to emulate the COUNT aggregate function which should behave similarly to its SQL original? Let's extend this question even more to include other regular SQL functions:

  • COUNT
  • SUM / MAX/ MIN / AVG
  • ARRAY_AGG

The last one is not a standard SQL function - it's from PostgreSQL but is quite useful.

At input comes a stream of valid JSON objects. For demonstration let's pick a simple story of owners and their pets.

Model and data

Base relation: Owner

id name  age
 1 Adams  25
 2 Baker  55
 3 Clark  40
 4 Davis  31

Base relation: Pet

id name  litter owner_id
10 Bella      4        1
20 Lucy       2        1
30 Daisy      3        2
40 Molly      4        3
50 Lola       2        4
60 Sadie      4        4
70 Luna       3        4

Source

From above we get a derivative relation Owner_Pet (a result of SQL JOIN of the above relations) presented in JSON format for our jq queries (the source data):

{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 10, "pet": "Bella", "litter": 4 }
{ "owner_id": 1, "owner": "Adams", "age": 25, "pet_id": 20, "pet": "Lucy",  "litter": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pet_id": 30, "pet": "Daisy", "litter": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pet_id": 40, "pet": "Molly", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 50, "pet": "Lola",  "litter": 2 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 60, "pet": "Sadie", "litter": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pet_id": 70, "pet": "Luna",  "litter": 3 }

Requests

Here are sample requests and their expected output:

  • COUNT the number of pets per owner:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets_count": 2 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets_count": 1 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets_count": 1 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets_count": 3 }
  • SUM up the number of whelps per owner and get their MAX (MIN/AVG):
{ "owner_id": 1, "owner": "Adams", "age": 25, "litter_total": 6, "litter_max": 4 }
{ "owner_id": 2, "owner": "Baker", "age": 55, "litter_total": 3, "litter_max": 3 }
{ "owner_id": 3, "owner": "Clark", "age": 40, "litter_total": 4, "litter_max": 4 }
{ "owner_id": 4, "owner": "Davis", "age": 31, "litter_total": 9, "litter_max": 4 }
  • ARRAY_AGG pets per owner:
{ "owner_id": 1, "owner": "Adams", "age": 25, "pets": [ "Bella", "Lucy" ] }
{ "owner_id": 2, "owner": "Baker", "age": 55, "pets": [ "Daisy" ] }
{ "owner_id": 3, "owner": "Clark", "age": 40, "pets": [ "Molly" ] }
{ "owner_id": 4, "owner": "Davis", "age": 31, "pets": [ "Lola", "Sadie", "Luna" ] }
like image 236
Onkeltem Avatar asked Jan 18 '18 12:01

Onkeltem


People also ask

Can we use GROUP BY with aggregate function?

The GROUP BY statement is often used with aggregate functions ( COUNT() , MAX() , MIN() , SUM() , AVG() ) to group the result-set by one or more columns.

How Group Wise aggregate functions are performed in SQL?

The Group By statement is used to group together any rows of a column with the same value stored in them, based on a function specified in the statement. Generally, these functions are one of the aggregate functions such as MAX() and SUM(). This statement is used with the SELECT command in SQL.

Can we use multiple aggregate function in GROUP BY clause?

In this example, I'll put the subquery in the FROM clause. GROUP BY country; The principle when combining two aggregate functions is to use the subquery for calculating the 'inner' statistic. Then the result is used in the aggregate functions of the outer query.


1 Answers

Here's an alternative, not using any custom functions with basic JQ. (I took the liberty to get rid of redundant parts of the question)

Count

In> jq -s 'group_by(.owner_id) |  map({ owner_id: .[0].owner_id, count: map(.pet) | length})'
Out>[{"owner_id": "1","pets_count": 2}, ...]

Sum

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, sum: map(.litter) | add})'
Out> [{"owner_id": "1","sum": 6}, ...]

Max

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, max: map(.litter) | max})'
Out> [{"owner_id": "1","max": 4}, ...]

Aggregate

In> jq -s 'group_by(.owner_id) | map({owner_id: .[0].owner_id, agg: map(.pet) })'
Out> [{"owner_id": "1","agg": ["Bella","Lucy"]}, ...]

Sure, these might not be the most efficient implementations, but they show nicely how to implement custom functions oneself. All that changes between the different functions is inside the last map and the function after the pipe | (length, add, max)

The first map iterates over the different groups, taking the name from the first item, and using map again to iterate over the same-group items. Not as pretty as SQL, but not terribly more complicated.

I learned JQ today, and managed to do this already, so this should be encouraging for anyone getting started. JQ is neither like sed nor like SQL, but not terribly hard either.

like image 65
Cornelius Roemer Avatar answered Sep 30 '22 03:09

Cornelius Roemer