Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clickhouse running diff with grouping

Tags:

clickhouse

General Task

A table consists of three columns (time, key, value). The task is to calculate a running difference for each key. So, from input

---------------
| time | key | value |
---------------
| 1    | A   | 4     |
| 2    | B   | 1     |
| 3    | A   | 6     |
| 4    | A   | 7     |
| 5    | B   | 3     |
| 6    | B   | 7     |

it is desired to get

----------------------
| key | value | delta |
----------------------
| A   | 4     |  0    |
| B   | 1     |  0    |
| A   | 6     |  2    |
| A   | 7     |  1    |
| B   | 3     |  2    |
| B   | 7     |  4    |

Approaches

  1. runningDifference function. Works, if the key is fixed. So we can

    select *, runningDifference(value) from
        (SELECT key, value from table where key = 'A' order by time)
    

    Note that subquery is necessary here. This solution suffers when you want to get this for different keys

  2. groupArray.

    select key, groupArray(value) from
        (SELECT key, value from table order by time)
        group by key
    

    So, now we get a key and an array of elements with this key. Good.

    But how to calculate a sliding difference? If we could do that, then ARRAY JOIN would lead us to a result.

    Or we can even zip the array with itself and then apply lambda (we have arrayMap for that) but... we don't have any zip alternative.

Any ideas? Thanks in advance.

like image 604
Konstantin Sekeresh Avatar asked Jan 28 '23 14:01

Konstantin Sekeresh


1 Answers

Solution with arrays:

WITH 
   groupArray(value) as time_sorted_vals,
   arrayEnumerate(time_sorted_vals) as indexes,
   arrayMap( i -> time_sorted_vals[i] - time_sorted_vals[i-1], indexes) as running_diffs
SELECT 
   key,
   running_diffs
FROM 
    (SELECT key, value from table order by time)
GROUP by key

Other option (doing sort inside each group separately, which is more optimal in a lot of cases)

WITH
   groupArray( tuple(value,time) ) as val_time_tuples,
   arraySort( x -> x.2, val_time_tuples ) as val_time_tuples_sorted,
   arrayMap( t -> t.1, indexes) as time_sorted_vals,
   arrayEnumerate(time_sorted_vals) as indexes,
   arrayMap( i -> time_sorted_vals[i] - time_sorted_vals[i-1], indexes) as running_diffs
SELECT 
   key,
   running_diffs
FROM 
   time
GROUP by key

and you can apply ARRAY JOIN on the result afterward.

like image 157
filimonov Avatar answered Jan 30 '23 05:01

filimonov