Disclaimer: The shown problem is much more general than I expected first. The example below is taken from a solution to another question. But now I was taking this sample for solving many problems more - mostly related to time series (have a look at the "Linked" section in the right bar).
So I am trying to explain the problem more generally first:
I am using PostgreSQL but I am sure this problem exists in other window function supporting DBMS' (MS SQL Server, Oracle, ...) as well.
Window functions can be used to group certain values together by a common attribute or value. For example you can group rows by a date. Then you are able to calculate the max value within every single date or an average value or counting rows or whatever.
This can be achieved by defining a PARTITION
. Grouping by dates would work with PARTITION BY date_column
. Now you want to do an operation which needs a special order within your groups (calculating row numbers or sum up a column). This can be done with PARTITON BY date_column ORDER BY an_attribute_column
.
Now think about a finer resolution of time series. What if you do not have dates but timestamps. Then you cannot group by the time column anymore. But nevertheless it might be important to analyse the data in the order they were added (maybe the timestamp is the creating time of your data set). Then you realize that some consecutive rows have the same value and you want to group your data by this common value. But the clue is that the rows have different timestamps.
The problem here is that you cannot do a PARTITION BY value_column
. Because PARTITION BY
forces an ordering first. So your table would be ordered by the value_column
before the grouping and is not ordered by the timestamp anymore. This yields in results you are not expecting.
More general speaking: The problem is to ensure a special ordering even if the ordered column is not part of the created partition.
Example:
db<>fiddle
I have the following table:
ts val
100000 50
130100 30050
160100 60050
190200 100
220200 30100
250200 30100
300000 300
500000 100
550000 1000
600000 1000
650000 2000
700000 2000
720000 2000
750000 300
I had the problem that I had to group all tied values of the column val
. But I wanted to hold the order by ts
. To achieve this I wanted to add a column with a unique ID per val
group
Expected result:
ts val group
100000 50 1
130100 30050 2
160100 60050 3
190200 100 4
220200 30100 5 \ same group
250200 30100 5 /
300000 300 6
500000 100 7
550000 1000 8 \ same group
600000 1000 8 /
650000 2000 9 \
700000 2000 9 | same group
720000 2000 9 /
750000 300 10
First try was the use of the rank
window function which would do this job normally:
SELECT
*,
rank() OVER (PARTITION BY val ORDER BY ts)
FROM
test
But in this case this doesn't work because the PARTITION BY
clause orders the table first by its partition columns (val
in this case) and then by its ORDER BY
columns. So the order is by val, ts
instead of the expected order by ts
. So the result was not the expected one of course.
ts val rank
100000 50 1
190200 100 1
500000 100 2
300000 300 1
750000 300 2
550000 1000 1
600000 1000 2
650000 2000 1
700000 2000 2
720000 2000 3
130100 30050 1
220200 30100 1
250200 30100 2
160100 60050 1
The question is: How to get the group ids with respect to the order by ts
?
Edit: I added an own solution below but I feel very uncomfortable with it. It seems way too complicated. I was wondering if there's a better way to achieve this result.
The order of the integers in the sum "does not matter": that is, two expressions that contain the same integers in a different order are considered to be the same partition. The number of different partitions of n is denoted p ( n ) p(n) p(n).
If the window ORDER BY clause is omitted, then rows are processed in an unspecified order so that the results of any window function invoked in this way would be unpredictable and therefore meaningless. Aggregation functions invoked in this way might be sensitive to what the window ORDER BY clause says.
ORDER BY order_list The ORDER BY clause defines the logical order of the rows within each partition of the result set. If no PARTITION BY is specified, ORDER BY uses the entire table. ORDER BY is optional for the aggregate window functions and required for the ranking functions.
Partition By: This divides the rows or query result set into small partitions. Order By: This arranges the rows in ascending or descending order for the partition window. The default order is ascending. Row or Range: You can further limit the rows in a partition by specifying the start and endpoints.
I came up with this solution by myself (hoping someone else will get a better one):
demo:db<>fiddle
ts
val
value with the lag
window function (https://www.postgresql.org/docs/current/static/tutorial-window.html)0
or a 1
SUM
. This generates the groups I am looking for. They group the val
column but ensure the ordering by the ts
column. The query:
SELECT
*,
SUM(is_diff) OVER (ORDER BY ts)
FROM (
SELECT
*,
CASE WHEN val = lag(val) over (order by ts) THEN 0 ELSE 1 END as is_diff
FROM test
)s
The result:
ts val is_diff sum
100000 50 1 1
130100 30050 1 2
160100 60050 1 3
190200 100 1 4
220200 30100 1 5 \ group
250200 30100 0 5 /
300000 300 1 6
500000 100 1 7
550000 1000 1 8 \ group
600000 1000 0 8 /
650000 2000 1 9 \
700000 2000 0 9 | group
720000 2000 0 9 /
750000 300 1 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With