This question is primarily about older versions of PrestoSQL, which have been resolved in the (now renamed) Trino project as of versions 346. However, Amazon's Athena project is based off of Presto versions 0.217 (Athena Engine 2) and 0.172 (Athena Engine 1), which does have the issues described below. This question was written specifically around Athena Engine 1 / PrestoSQL version 0.172
ROWS BETWEEN
and RANGE BETWEEN
in Presto window Functions?
ROWS BETWEEN
allow more options than RANGE BETWEEN
?ROWS BETWEEN
and RANGE BETWEEN
and get different results?
unbounded
/current row
, is there a scenario where you'd use RANGE
instead of ROWS
(or vice-versa)?ROWS
has more options, why isn't it mentioned at all in the documentation? o_OThe presto documentation is fairly quiet about even RANGE
, and doesn't mention ROWS
. I haven't found many discussions or examples around window functions in Presto. I'm starting to set through the Presto code-base to try to figure this out. Hopefully someone can save me from that, and we can improve the documentation together.
The Presto code has a parser and test cases for the ROWS
variant, but there's no mention in the documentation of ROWS
.
The test cases I found with both ROWS
and RANGE
don't test anything different between the two syntaxes.
They almost look like synonyms, but they do behave differently in my testing, and have different allowed parameters and validation rules.
The following examples can be run with the starburstdata/presto Docker image running Presto 0.213-e-0.1. Typically I run Presto 0.172 through Amazon Athena, and have almost always ended up using ROWS
.
RANGE seems to be limited to "UNBOUNDED" and "CURRENT ROW". The following returns an error:
range between 1 preceding and 1 following
use tpch.tiny;
select custkey, orderdate,
array_agg(orderdate) over (
partition by custkey
order by orderdate asc
range between 1 preceding and 1 following
) previous_orders
from orders where custkey in (419, 320) and orderdate < date('1996-01-01')
order by custkey, orderdate asc;
ERROR:
Window frame RANGE PRECEDING is only supported with UNBOUNDED
The following range syntaxes do work fine (with expected differing results). All following examples based on the above query, just changing the range
range between unbounded preceding and current row
custkey | orderdate | previous_orders
---------+------------+--------------------------------------------------------------------------
320 | 1992-07-10 | [1992-07-10]
320 | 1992-07-30 | [1992-07-10, 1992-07-30]
320 | 1994-07-08 | [1992-07-10, 1992-07-30, 1994-07-08]
320 | 1994-08-04 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04]
320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18]
320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
419 | 1992-03-16 | [1992-03-16]
419 | 1993-12-29 | [1992-03-16, 1993-12-29]
419 | 1995-01-30 | [1992-03-16, 1993-12-29, 1995-01-30]
range between current row and unbounded following
custkey | orderdate | previous_orders
---------+------------+--------------------------------------------------------------------------
320 | 1992-07-10 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1992-07-30 | [1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-07-08 | [1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-08-04 | [1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-09-18 | [1994-09-18, 1994-10-12]
320 | 1994-10-12 | [1994-10-12]
419 | 1992-03-16 | [1992-03-16, 1993-12-29, 1995-01-30]
419 | 1993-12-29 | [1993-12-29, 1995-01-30]
419 | 1995-01-30 | [1995-01-30]
range between unbounded preceding and unbounded following
custkey | orderdate | previous_orders
---------+------------+--------------------------------------------------------------------------
320 | 1992-07-10 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1992-07-30 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-07-08 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-08-04 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04, 1994-09-18, 1994-10-12]
419 | 1992-03-16 | [1992-03-16, 1993-12-29, 1995-01-30]
419 | 1993-12-29 | [1992-03-16, 1993-12-29, 1995-01-30]
419 | 1995-01-30 | [1992-03-16, 1993-12-29, 1995-01-30]
The three working examples for RANGE
above all work for ROWS
and produce identical output.
rows between unbounded preceding and current row
rows between current row and unbounded following
rows between unbounded preceding and unbounded following
output omitted - identical to above
However, ROWS
allows for far more control, since you can also do the syntax above that fails with range
:
rows between 1 preceding and 1 following
custkey | orderdate | previous_orders
---------+------------+--------------------------------------
320 | 1992-07-10 | [1992-07-10, 1992-07-30]
320 | 1992-07-30 | [1992-07-10, 1992-07-30, 1994-07-08]
320 | 1994-07-08 | [1992-07-30, 1994-07-08, 1994-08-04]
320 | 1994-08-04 | [1994-07-08, 1994-08-04, 1994-09-18]
320 | 1994-09-18 | [1994-08-04, 1994-09-18, 1994-10-12]
320 | 1994-10-12 | [1994-09-18, 1994-10-12]
419 | 1992-03-16 | [1992-03-16, 1993-12-29]
419 | 1993-12-29 | [1992-03-16, 1993-12-29, 1995-01-30]
419 | 1995-01-30 | [1993-12-29, 1995-01-30]
rows between current row and 1 following
custkey | orderdate | previous_orders
---------+------------+--------------------------
320 | 1992-07-10 | [1992-07-10, 1992-07-30]
320 | 1992-07-30 | [1992-07-30, 1994-07-08]
320 | 1994-07-08 | [1994-07-08, 1994-08-04]
320 | 1994-08-04 | [1994-08-04, 1994-09-18]
320 | 1994-09-18 | [1994-09-18, 1994-10-12]
320 | 1994-10-12 | [1994-10-12]
419 | 1992-03-16 | [1992-03-16, 1993-12-29]
419 | 1993-12-29 | [1993-12-29, 1995-01-30]
419 | 1995-01-30 | [1995-01-30]
rows between 5 preceding and 2 preceding
custkey | orderdate | previous_orders
---------+------------+--------------------------------------------------
320 | 1992-07-10 | NULL
320 | 1992-07-30 | NULL
320 | 1994-07-08 | [1992-07-10]
320 | 1994-08-04 | [1992-07-10, 1992-07-30]
320 | 1994-09-18 | [1992-07-10, 1992-07-30, 1994-07-08]
320 | 1994-10-12 | [1992-07-10, 1992-07-30, 1994-07-08, 1994-08-04]
419 | 1992-03-16 | NULL
419 | 1993-12-29 | NULL
419 | 1995-01-30 | [1992-03-16]
ROWS
are literally number of rows before and after that you want to aggregate. So ORDER BY day ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
will end up with 3 rows: the curnet row 1 row before and 1 row after, regardless of the value of orderdate.RANGE
will look at the values of orderdate and will decide what should be aggregated and what not. So ORDER BY day RANGE BETWEEN 1 PRECEDING AND 1 FOLLOWING
would theoretically take all lines with values of orderdate-1, orderdate and orderdate+1 - this can be more than 3 lines (see more explanations here)In Presto the ROWS
is fully implemented, but the RANGE
is somehow only partially implemented, and you can only use in with CURRENT ROW
and UNBOUNDED
.
NOTE: Recent versions of Trino (formerly known as Presto SQL) have full support for
RANGE
andGROUPS
framing. See this blog post for an explanation of how they work.
The best way in Presto, to be able to see the diff between the two, is to make sure you have same values of the order clause:
WITH
tt1 (custkey, orderdate, product) AS
( SELECT * FROM ( VALUES ('a','1992-07-10', 3), ('a','1993-08-10', 4), ('a','1994-07-13', 5), ('a','1995-09-13', 5), ('a','1995-09-13', 9), ('a','1997-01-13', 4),
('b','1992-07-10', 6), ('b','1992-07-10', 4), ('b','1994-07-13', 5), ('b','1994-07-13', 9), ('b','1998-11-11', 9) ) )
SELECT *,
array_agg(product) OVER (partition by custkey) c,
array_agg(product) OVER (partition by custkey order by orderdate) c_order,
array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) range_ubub,
array_agg(product) OVER (partition by custkey order by orderdate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rows_ubub,
array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) range_ubc,
array_agg(product) OVER (partition by custkey order by orderdate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) rows_ubc,
array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) range_cub,
array_agg(product) OVER (partition by custkey order by orderdate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) rows_cub,
-- array_agg(product) OVER (partition by custkey order by orderdate RANGE BETWEEN 2 PRECEDING AND 2 FOLLOWING) range22,
-- SYNTAX_ERROR: line 19:65: Window frame RANGE PRECEDING is only supported with UNBOUNDED
array_agg(product) OVER (partition by custkey order by orderdate ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) rows22
from tt1
order by custkey, orderdate, product
You can run, and see full results, and learn from them..
I'll put here only some interesting columns:
custkey orderdate product range_ubc rows_ubc
a 10/07/1992 3 [3] [3]
a 10/08/1993 4 [3, 4] [3, 4]
a 13/07/1994 5 [3, 4, 5] [3, 4, 5]
a 13/09/1995 5 [3, 4, 5, 5, 9] [3, 4, 5, 5]
a 13/09/1995 9 [3, 4, 5, 5, 9] [3, 4, 5, 5, 9]
a 13/01/1997 4 [3, 4, 5, 5, 9, 4] [3, 4, 5, 5, 9, 4]
b 10/07/1992 4 [6, 4] [6, 4]
b 10/07/1992 6 [6, 4] [6]
b 13/07/1994 5 [6, 4, 5, 9] [6, 4, 5]
b 13/07/1994 9 [6, 4, 5, 9] [6, 4, 5, 9]
b 11/11/1998 9 [6, 4, 5, 9, 9] [6, 4, 5, 9, 9]
If you look at the 5th line of: orderdate:13/09/1995, product:5
(Note: 13/09/1995
appears twice for custkey:a
) you can see that the ROWS
indeed took all rows from top till current line. But if you look at the RANGE
, you see it includes also the value from the row after as it has the exact same orderdate
so it is considered in same window.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With