From the PySpark docs rangeBetween
:
rangeBetween(start, end)
Defines the frame boundaries, from start (inclusive) to end (inclusive).
Both start and end are relative from the current row. For example, “0” means “current row”, while “-1” means one off before the current row, and “5” means the five off after the current row.
Parameters:
- start – boundary start, inclusive. The frame is unbounded if this is -sys.maxsize (or lower).
- end – boundary end, inclusive. The frame is unbounded if this is sys.maxsize (or higher). New in version 1.4.
while rowsBetween
rowsBetween(start, end)
Defines the frame boundaries, from start (inclusive) to end (inclusive).
Both start and end are relative positions from the current row. For example, “0” means “current row”, while “-1” means the row before the current row, and “5” means the fifth row after the current row.
Parameters:
- start – boundary start, inclusive. The frame is unbounded if this is -sys.maxsize (or lower).
- end – boundary end, inclusive. The frame is unbounded if this is sys.maxsize (or higher). New in version 1.4.
For rangeBetween
how is "1 off" different from "1 row", for example?
A row based boundary is based on the position of the row within the partition. An offset indicates the number of rows above or below the current row, the frame for the current row starts or ends. For instance, given a row based sliding frame with a lower bound offset of -1 and a upper bound offset of +2.
To calculate a difference, you need a pair of records; those two records are “the current record” and “the previous year's record”. You obtain this record using the LAG() window function. This function allows you to obtain data from the previous record (based on an order criterion, which here is “ ORDER BY year ”).
UNBOUNDED PRECEDING is the default. CURRENT ROW indicates the window begins or ends at the current row. UNBOUNDED FOLLOWING indicates that the window ends at the last row of the partition; offset FOLLOWING indicates that the window ends a number of rows equivalent to the value of offset after the current row.
The frame, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, means that the window consists of the first row of the partition and all the rows up to the current row. Each calculation is done over a different set of rows. For example, when performing the calculation for row 4, the rows 1 to 4 are used.
It is simple:
ROWS BETWEEN
doesn't care about the exact values. It cares only about the order of rows, and takes fixed number of preceding and following rows when computing frame.RANGE BETWEEN
considers values when computing frame.Let's use an example using two window definitions:
ORDER BY x ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
ORDER BY x RANGE BETWEEN 2 PRECEDING AND CURRENT ROW
and data as
+---+ | x| +---+ | 10| | 20| | 30| | 31| +---+
Assuming the current row is the one with value 31 for the first window following rows will be included (current one, and two preceding):
+---+----------------------------------------------------+ | x|ORDER BY x ROWS BETWEEN 2 PRECEDING AND CURRENT ROW| +---+----------------------------------------------------+ | 10| false| | 20| true| | 30| true| | 31| true| +---+----------------------------------------------------+
and for the second one following (current one, and all preceding where x >= 31 - 2):
+---+-----------------------------------------------------+ | x|ORDER BY x RANGE BETWEEN 2 PRECEDING AND CURRENT ROW| +---+-----------------------------------------------------+ | 10| false| | 20| false| | 30| true| | 31| true| +---+-----------------------------------------------------+
The Java spark docs add clarity: https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/expressions/WindowSpec.html#rowsBetween-long-long-
rangeBetween
A range-based boundary is based on the actual value of the ORDER BY expression(s). An offset is used to alter the value of the ORDER BY expression, for instance if the current order by expression has a value of 10 and the lower bound offset is -3, the resulting lower bound for the current row will be 10 - 3 = 7. This however puts a number of constraints on the ORDER BY expressions: there can be only one expression and this expression must have a numerical data type. An exception can be made when the offset is unbounded, because no value modification is needed, in this case multiple and non-numeric ORDER BY expression are allowed.
rowBetween
A row based boundary is based on the position of the row within the partition. An offset indicates the number of rows above or below the current row, the frame for the current row starts or ends. For instance, given a row based sliding frame with a lower bound offset of -1 and a upper bound offset of +2. The frame for row with index 5 would range from index 4 to index 6.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With