Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Window functions: PARTITION BY one column after ORDER BY another

Disclaimer: The shown problem is much more general than I expected first. The example below is taken from a solution to another question. But now I was taking this sample for solving many problems more - mostly related to time series (have a look at the "Linked" section in the right bar).

So I am trying to explain the problem more generally first:


I am using PostgreSQL but I am sure this problem exists in other window function supporting DBMS' (MS SQL Server, Oracle, ...) as well.


Window functions can be used to group certain values together by a common attribute or value. For example you can group rows by a date. Then you are able to calculate the max value within every single date or an average value or counting rows or whatever.

This can be achieved by defining a PARTITION. Grouping by dates would work with PARTITION BY date_column. Now you want to do an operation which needs a special order within your groups (calculating row numbers or sum up a column). This can be done with PARTITON BY date_column ORDER BY an_attribute_column.

Now think about a finer resolution of time series. What if you do not have dates but timestamps. Then you cannot group by the time column anymore. But nevertheless it might be important to analyse the data in the order they were added (maybe the timestamp is the creating time of your data set). Then you realize that some consecutive rows have the same value and you want to group your data by this common value. But the clue is that the rows have different timestamps.

The problem here is that you cannot do a PARTITION BY value_column. Because PARTITION BY forces an ordering first. So your table would be ordered by the value_column before the grouping and is not ordered by the timestamp anymore. This yields in results you are not expecting.

More general speaking: The problem is to ensure a special ordering even if the ordered column is not part of the created partition.


Example:

db<>fiddle

I have the following table:

ts      val
100000  50
130100  30050
160100  60050 
190200  100
220200  30100 
250200  30100 
300000  300
500000  100
550000  1000  
600000  1000
650000  2000  
700000  2000
720000  2000
750000  300

I had the problem that I had to group all tied values of the column val. But I wanted to hold the order by ts. To achieve this I wanted to add a column with a unique ID per val group

Expected result:

ts      val     group
100000  50      1
130100  30050   2
160100  60050   3
190200  100     4
220200  30100   5     \ same group
250200  30100   5     /
300000  300     6
500000  100     7
550000  1000    8     \ same group
600000  1000    8     /
650000  2000    9     \
700000  2000    9     | same group
720000  2000    9     /
750000  300     10

First try was the use of the rank window function which would do this job normally:

SELECT 
    *,
    rank() OVER (PARTITION BY val ORDER BY ts)
FROM 
    test

But in this case this doesn't work because the PARTITION BY clause orders the table first by its partition columns (val in this case) and then by its ORDER BY columns. So the order is by val, ts instead of the expected order by ts. So the result was not the expected one of course.

ts       val     rank
100000   50      1
190200   100     1
500000   100     2
300000   300     1
750000   300     2
550000   1000    1
600000   1000    2
650000   2000    1
700000   2000    2
720000   2000    3
130100   30050   1
220200   30100   1
250200   30100   2
160100   60050   1

The question is: How to get the group ids with respect to the order by ts?


Edit: I added an own solution below but I feel very uncomfortable with it. It seems way too complicated. I was wondering if there's a better way to achieve this result.

like image 946
S-Man Avatar asked Sep 13 '18 18:09

S-Man


People also ask

Does order matter in partition by?

The order of the integers in the sum "does not matter": that is, two expressions that contain the same integers in a different order are considered to be the same partition. The number of different partitions of n is denoted p ( n ) p(n) p(n).

What is happening when you omit the ORDER BY clause when doing aggregates with window functions?

If the window ORDER BY clause is omitted, then rows are processed in an unspecified order so that the results of any window function invoked in this way would be unpredictable and therefore meaningless. Aggregation functions invoked in this way might be sensitive to what the window ORDER BY clause says.

How does ORDER BY work in window function?

ORDER BY order_list The ORDER BY clause defines the logical order of the rows within each partition of the result set. If no PARTITION BY is specified, ORDER BY uses the entire table. ORDER BY is optional for the aggregate window functions and required for the ranking functions.

What is the difference between ORDER BY and partition by?

Partition By: This divides the rows or query result set into small partitions. Order By: This arranges the rows in ascending or descending order for the partition window. The default order is ascending. Row or Range: You can further limit the rows in a partition by specifying the start and endpoints.


1 Answers

I came up with this solution by myself (hoping someone else will get a better one):

demo:db<>fiddle

  1. order by ts
  2. give out the next val value with the lag window function (https://www.postgresql.org/docs/current/static/tutorial-window.html)
  3. check if the next and the current values are the same. Then I can print out a 0 or a 1
  4. sum up these values with an ordered SUM. This generates the groups I am looking for. They group the val column but ensure the ordering by the ts column.

The query:

SELECT 
    *, 
    SUM(is_diff) OVER (ORDER BY ts) 
FROM (
    SELECT 
        *,
        CASE WHEN val = lag(val) over (order by ts) THEN 0 ELSE 1 END as is_diff
    FROM test 
)s

The result:

ts       val     is_diff   sum
100000   50      1         1
130100   30050   1         2
160100   60050   1         3
190200   100     1         4
220200   30100   1         5    \ group
250200   30100   0         5    /
300000   300     1         6
500000   100     1         7
550000   1000    1         8    \ group
600000   1000    0         8    /
650000   2000    1         9    \
700000   2000    0         9    | group
720000   2000    0         9    /
750000   300     1         10
like image 143
S-Man Avatar answered Sep 21 '22 12:09

S-Man