Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between "ROWS BETWEEN" and "RANGE BETWEEN" in (Presto) window function "OVER" clause

This question is primarily about older versions of PrestoSQL, which have been resolved in the (now renamed) Trino project as of versions 346. However, Amazon's Athena project is based off of Presto versions 0.217 (Athena Engine 2) and 0.172 (Athena Engine 1), which does have the issues described below. This question was written specifically around Athena Engine 1 / PrestoSQL version 0.172

Questions (tl;dr)

  1. What is the difference between ROWS BETWEEN and RANGE BETWEEN in Presto window Functions?
    • Are these just synonyms for each other, or are there core conceptual differences?
    • If they are just synonyms, why does ROWS BETWEEN allow more options than RANGE BETWEEN?
  2. Is there a query scenario where it's possible to use the exact same parameters on ROWS BETWEEN and RANGE BETWEEN and get different results?
    • If using just unbounded/current row, is there a scenario where you'd use RANGE instead of ROWS (or vice-versa)?
  3. Since ROWS has more options, why isn't it mentioned at all in the documentation? o_O

Comments

The presto documentation is fairly quiet about even RANGE, and doesn't mention ROWS. I haven't found many discussions or examples around window functions in Presto. I'm starting to set through the Presto code-base to try to figure this out. Hopefully someone can save me from that, and we can improve the documentation together.

The Presto code has a parser and test cases for the ROWS variant, but there's no mention in the documentation of ROWS.

The test cases I found with both ROWS and RANGE don't test anything different between the two syntaxes.

They almost look like synonyms, but they do behave differently in my testing, and have different allowed parameters and validation rules.

The following examples can be run with the starburstdata/presto Docker image running Presto 0.213-e-0.1. Typically I run Presto 0.172 through Amazon Athena, and have almost always ended up using ROWS.

RANGE

RANGE seems to be limited to "UNBOUNDED" and "CURRENT ROW". The following returns an error:

range between 1 preceding and 1 following

use tpch.tiny;

select custkey, orderdate,
       array_agg(orderdate) over ( 
           partition by custkey 
           order by orderdate asc 
           range between 1 preceding and 1 following
       ) previous_orders 
from orders where custkey in (419, 320) and orderdate < date('1996-01-01')
order by custkey, orderdate asc;

ERROR: Window frame RANGE PRECEDING is only supported with UNBOUNDED

The following range syntaxes do work fine (with expected differing results). All following examples based on the above query, just changing the range

range between unbounded preceding and current row

 custkey | orderdate  |                             previous_orders
---------+------------+--------------------------------------------------------------------------
     320 | 1992-07-10 | [1992-07-10]
     320 | 1992-07-30 | [1992-07-10, 1992-07-30]
     320 | 1994-07-08 | [1992-07-10, 1992-07-30, 1994-07-08]
     320 | 1994-08-04 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04]
     320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18]
     320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     419 | 1992-03-16 | [1992-03-16]
     419 | 1993-12-29 | [1992-03-16, 1993-12-29]
     419 | 1995-01-30 | [1992-03-16, 1993-12-29, 1995-01-30]

range between current row and unbounded following

 custkey | orderdate  |                             previous_orders
---------+------------+--------------------------------------------------------------------------
     320 | 1992-07-10 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1992-07-30 | [1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-07-08 | [1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-08-04 | [1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-09-18 | [1994-09-18, 1994-10-12]
     320 | 1994-10-12 | [1994-10-12]
     419 | 1992-03-16 | [1992-03-16, 1993-12-29, 1995-01-30]
     419 | 1993-12-29 | [1993-12-29, 1995-01-30]
     419 | 1995-01-30 | [1995-01-30]

range between unbounded preceding and unbounded following

 custkey | orderdate  |                             previous_orders
---------+------------+--------------------------------------------------------------------------
     320 | 1992-07-10 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1992-07-30 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-07-08 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-08-04 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
     419 | 1992-03-16 | [1992-03-16, 1993-12-29, 1995-01-30]
     419 | 1993-12-29 | [1992-03-16, 1993-12-29, 1995-01-30]
     419 | 1995-01-30 | [1992-03-16, 1993-12-29, 1995-01-30]

ROWS

The three working examples for RANGE above all work for ROWS and produce identical output.

rows between unbounded preceding and current row
rows between current row and unbounded following
rows between unbounded preceding and unbounded following

output omitted - identical to above

However, ROWS allows for far more control, since you can also do the syntax above that fails with range:

rows between 1 preceding and 1 following

 custkey | orderdate  |           previous_orders
---------+------------+--------------------------------------
     320 | 1992-07-10 | [1992-07-10, 1992-07-30]
     320 | 1992-07-30 | [1992-07-10, 1992-07-30, 1994-07-08]
     320 | 1994-07-08 | [1992-07-30, 1994-07-08, 1994-08-04]
     320 | 1994-08-04 | [1994-07-08, 1994-08-04, 1994-09-18]
     320 | 1994-09-18 | [1994-08-04, 1994-09-18, 1994-10-12]
     320 | 1994-10-12 | [1994-09-18, 1994-10-12]
     419 | 1992-03-16 | [1992-03-16, 1993-12-29]
     419 | 1993-12-29 | [1992-03-16, 1993-12-29, 1995-01-30]
     419 | 1995-01-30 | [1993-12-29, 1995-01-30]

rows between current row and 1 following

 custkey | orderdate  |     previous_orders
---------+------------+--------------------------
     320 | 1992-07-10 | [1992-07-10, 1992-07-30]
     320 | 1992-07-30 | [1992-07-30, 1994-07-08]
     320 | 1994-07-08 | [1994-07-08, 1994-08-04]
     320 | 1994-08-04 | [1994-08-04, 1994-09-18]
     320 | 1994-09-18 | [1994-09-18, 1994-10-12]
     320 | 1994-10-12 | [1994-10-12]
     419 | 1992-03-16 | [1992-03-16, 1993-12-29]
     419 | 1993-12-29 | [1993-12-29, 1995-01-30]
     419 | 1995-01-30 | [1995-01-30]

rows between 5 preceding and 2 preceding

 custkey | orderdate  |                 previous_orders
---------+------------+--------------------------------------------------
     320 | 1992-07-10 | NULL
     320 | 1992-07-30 | NULL
     320 | 1994-07-08 | [1992-07-10]
     320 | 1994-08-04 | [1992-07-10, 1992-07-30]
     320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08]
     320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04]
     419 | 1992-03-16 | NULL
     419 | 1993-12-29 | NULL
     419 | 1995-01-30 | [1992-03-16]
like image 652
mbafford Avatar asked Feb 19 '20 14:02

mbafford


1 Answers

  • ROWS are literally number of rows before and after that you want to aggregate. So ORDER BY day ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING will end up with 3 rows: the curnet row 1 row before and 1 row after, regardless of the value of orderdate.
  • RANGE will look at the values of orderdate and will decide what should be aggregated and what not. So ORDER BY day RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING would theoretically take all lines with values of orderdate-1, orderdate and orderdate+1 - this can be more than 3 lines (see more explanations here)

In Presto the ROWS is fully implemented, but the RANGE is somehow only partially implemented, and you can only use in with CURRENT ROW and UNBOUNDED.

NOTE: Recent versions of Trino (formerly known as Presto SQL) have full support for RANGE and GROUPS framing. See this blog post for an explanation of how they work.

The best way in Presto, to be able to see the diff between the two, is to make sure you have same values of the order clause:

WITH
   tt1  (custkey, orderdate, product) AS 
      ( SELECT * FROM ( VALUES ('a','1992-07-10', 3), ('a','1993-08-10', 4), ('a','1994-07-13', 5), ('a','1995-09-13', 5), ('a','1995-09-13', 9), ('a','1997-01-13', 4),
                               ('b','1992-07-10', 6), ('b','1992-07-10', 4), ('b','1994-07-13', 5), ('b','1994-07-13', 9), ('b','1998-11-11', 9) )  )

SELECT *, 
       array_agg(product) OVER (partition by custkey) c, 
       array_agg(product) OVER (partition by custkey order by orderdate) c_order,
       
       array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) range_ubub,
       array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rows_ubub,
       
       array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) range_ubc,
       array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) rows_ubc,
       
       array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) range_cub,
       array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) rows_cub,
       
       -- array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING)  range22,
          -- SYNTAX_ERROR: line 19:65: Window frame RANGE PRECEDING is only supported with UNBOUNDED
       array_agg(product) OVER (partition by custkey order by orderdate ROWS  BETWEEN 2 PRECEDING AND 2 FOLLOWING)  rows22

from tt1
order by custkey, orderdate, product

You can run, and see full results, and learn from them..

I'll put here only some interesting columns:

custkey   orderdate     product    range_ubc           rows_ubc
a         10/07/1992    3          [3]                 [3]
a         10/08/1993    4          [3, 4]              [3, 4]
a         13/07/1994    5          [3, 4, 5]           [3, 4, 5]
a         13/09/1995    5          [3, 4, 5, 5, 9]     [3, 4, 5, 5]
a         13/09/1995    9          [3, 4, 5, 5, 9]     [3, 4, 5, 5, 9]
a         13/01/1997    4          [3, 4, 5, 5, 9, 4]  [3, 4, 5, 5, 9, 4]
b         10/07/1992    4          [6, 4]              [6, 4]
b         10/07/1992    6          [6, 4]              [6]
b         13/07/1994    5          [6, 4, 5, 9]        [6, 4, 5]
b         13/07/1994    9          [6, 4, 5, 9]        [6, 4, 5, 9]
b         11/11/1998    9          [6, 4, 5, 9, 9]     [6, 4, 5, 9, 9]

If you look at the 5th line of: orderdate:13/09/1995, product:5 (Note: 13/09/1995 appears twice for custkey:a) you can see that the ROWS indeed took all rows from top till current line. But if you look at the RANGE, you see it includes also the value from the row after as it has the exact same orderdate so it is considered in same window.

like image 70
Chananel P Avatar answered Oct 22 '22 23:10

Chananel P