Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the role of ORDER BY in the PARTITION BY function?

I have a table with data follow,

    ID        SEQ    EFFDAT                 
------- ---------    -----------------------
  1024          1    01/07/2010 12:00:00 AM   
  1024          3    18/04/2017 12:00:00 AM   
  1024          2    01/08/2017 12:00:00 AM   

When I execute the following query, I am getting wrong maximum sequence still I am getting the correct maximum effective date.

Query:

SELECT 
max(seq) over (partition by id order by EFFDAT desc) maxEffSeq,
partitionByTest.*,
max(EFFDAT) over (partition by (id) order by EFFDAT desc ) maxeffdat
FROM partitionByTest;

Output:

 MAXEFFSEQ         ID        SEQ EFFDAT                   MAXEFFDAT              
---------- ---------- ---------- ------------------------ ------------------------
         2       1024          2 01/08/2017 12:00:00 AM   01/08/2017 12:00:00 AM   
         3       1024          3 18/04/2017 12:00:00 AM   01/08/2017 12:00:00 AM   
         3       1024          1 01/07/2010 12:00:00 AM   01/08/2017 12:00:00 AM 

If I remove the order by in my query, I am getting the correct output.

Query:

SELECT max(seq) over (partition by id ) maxEffSeq, partitionByTest.*,
max(EFFDAT) over (partition by (id) order by EFFDAT desc ) maxeffdat
FROM partitionByTest;

Output:

 MAXEFFSEQ         ID        SEQ EFFDAT                   MAXEFFDAT              
---------- ---------- ---------- ------------------------ ------------------------
         3       1024          2 01/08/2017 12:00:00 AM   01/08/2017 12:00:00 AM   
         3       1024          3 18/04/2017 12:00:00 AM   01/08/2017 12:00:00 AM   
         3       1024          1 01/07/2010 12:00:00 AM   01/08/2017 12:00:00 AM   

I know that when we are using MAX function, it is not required to use order by clause. But I am interested to know how order by works in partition by function and why it is giving the wrong result for sequence and correct result for date when I use order by clause ?

like image 796
Karthik Avatar asked Oct 24 '25 00:10

Karthik


1 Answers

Adding an order by also implies a windowing clause, and as you have't specified one you get the default, so you're really doing:

max(seq) over (
  partition by id
  order by EFFDAT desc
  range between unbounded preceding and current row
)

If you think about how the data looks if you order it in the same way, by descending date:

select partitionbytest.*,
  count(*) over (partition by id order by effdat desc) range_rows,
  max(seq) over (partition by id order by effdat desc) range_max_seq,
  count(*) over (partition by id) id_rows,
  max(seq) over (partition by id) id_max_seq
from partitionbytest
order by effdat desc;

        ID        SEQ EFFDAT     RANGE_ROWS RANGE_MAX_SEQ    ID_ROWS ID_MAX_SEQ
---------- ---------- ---------- ---------- ------------- ---------- ----------
      1024          2 2017-08-01          1             2          3          3
      1024          3 2017-04-18          2             3          3          3
      1024          1 2010-07-01          3             3          3          3

then it becomes a bit clearer. I've included equivalent analytic counts so you can also see how many rows are being considered, with and without the order by clause.

  • For the first row the max seq value is found from looking at that current row's data and all preceding rows with later dates (as it's descending), and there are none of those, so it is the value from that row itself - so it's 2. The rows following it it, with seq values 3 and 1, are not considered.

  • For the second row it looks at the current row and all preceding rows with later dates, so it can consider both the preceding value of 2 and the current value of 3. Since 3 is highest among those, it shows that. The row following it it, with seq value 1, is not considered.

  • For the third row it looks at the current row and all preceding rows with later dates, so it can consider the preceding values of 2 and 3 and the current value of 1. Since 3 is still highest it shows that again.

Without the order by clause it always considers all values for that ID, so it sees 3 as the highest for all of them.

See the documentation for analytic functions for more details of how this is determined, partitularly:

The group of rows is called a window and is defined by the analytic_clause. For each row, a sliding window of rows is defined. The window determines the range of rows used to perform the calculations for the current row. Window sizes can be based on either a physical number of rows or a logical interval such as time.

and

You cannot specify [windowing_clause] unless you have specified the order_by_clause.

and

If you omit the windowing_clause entirely, then the default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

like image 178
Alex Poole Avatar answered Oct 26 '25 16:10

Alex Poole



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!