Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between rowsBetween and rangeBetween?

From the PySpark docs rangeBetween:

rangeBetween(start, end)

Defines the frame boundaries, from start (inclusive) to end (inclusive).

Both start and end are relative from the current row. For example, “0” means “current row”, while “-1” means one off before the current row, and “5” means the five off after the current row.

Parameters:

  • start – boundary start, inclusive. The frame is unbounded if this is -sys.maxsize (or lower).
  • end – boundary end, inclusive. The frame is unbounded if this is sys.maxsize (or higher). New in version 1.4.

while rowsBetween

rowsBetween(start, end)

Defines the frame boundaries, from start (inclusive) to end (inclusive).

Both start and end are relative positions from the current row. For example, “0” means “current row”, while “-1” means the row before the current row, and “5” means the fifth row after the current row.

Parameters:

  • start – boundary start, inclusive. The frame is unbounded if this is -sys.maxsize (or lower).
  • end – boundary end, inclusive. The frame is unbounded if this is sys.maxsize (or higher). New in version 1.4.

For rangeBetween how is "1 off" different from "1 row", for example?

like image 976
Evan Zamir Avatar asked Oct 14 '16 17:10

Evan Zamir


People also ask

What is Rowsbetween Pyspark?

A row based boundary is based on the position of the row within the partition. An offset indicates the number of rows above or below the current row, the frame for the current row starts or ends. For instance, given a row based sliding frame with a lower bound offset of -1 and a upper bound offset of +2.

How do you find the difference between two rows in SQL?

To calculate a difference, you need a pair of records; those two records are “the current record” and “the previous year's record”. You obtain this record using the LAG() window function. This function allows you to obtain data from the previous record (based on an order criterion, which here is “ ORDER BY year ”).

What is unbounded preceding and unbounded following?

UNBOUNDED PRECEDING is the default. CURRENT ROW indicates the window begins or ends at the current row. UNBOUNDED FOLLOWING indicates that the window ends at the last row of the partition; offset FOLLOWING indicates that the window ends a number of rows equivalent to the value of offset after the current row.

What is RANGE between unbounded preceding and current row?

The frame, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, means that the window consists of the first row of the partition and all the rows up to the current row. Each calculation is done over a different set of rows. For example, when performing the calculation for row 4, the rows 1 to 4 are used.


2 Answers

It is simple:

  • ROWS BETWEEN doesn't care about the exact values. It cares only about the order of rows, and takes fixed number of preceding and following rows when computing frame.
  • RANGE BETWEEN considers values when computing frame.

Let's use an example using two window definitions:

  • ORDER BY x ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  • ORDER BY x RANGE BETWEEN 2 PRECEDING AND CURRENT ROW

and data as

+---+ |  x| +---+ | 10| | 20| | 30| | 31| +---+ 

Assuming the current row is the one with value 31 for the first window following rows will be included (current one, and two preceding):

+---+----------------------------------------------------+ |  x|ORDER BY x ROWS BETWEEN 2  PRECEDING AND CURRENT ROW| +---+----------------------------------------------------+ | 10|                                               false| | 20|                                                true| | 30|                                                true| | 31|                                                true| +---+----------------------------------------------------+ 

and for the second one following (current one, and all preceding where x >= 31 - 2):

+---+-----------------------------------------------------+ |  x|ORDER BY x RANGE BETWEEN 2  PRECEDING AND CURRENT ROW| +---+-----------------------------------------------------+ | 10|                                                false| | 20|                                                false| | 30|                                                 true| | 31|                                                 true| +---+-----------------------------------------------------+ 
like image 140
zero323 Avatar answered Sep 28 '22 22:09

zero323


The Java spark docs add clarity: https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/expressions/WindowSpec.html#rowsBetween-long-long-

rangeBetween

A range-based boundary is based on the actual value of the ORDER BY expression(s). An offset is used to alter the value of the ORDER BY expression, for instance if the current order by expression has a value of 10 and the lower bound offset is -3, the resulting lower bound for the current row will be 10 - 3 = 7. This however puts a number of constraints on the ORDER BY expressions: there can be only one expression and this expression must have a numerical data type. An exception can be made when the offset is unbounded, because no value modification is needed, in this case multiple and non-numeric ORDER BY expression are allowed.

rowBetween

A row based boundary is based on the position of the row within the partition. An offset indicates the number of rows above or below the current row, the frame for the current row starts or ends. For instance, given a row based sliding frame with a lower bound offset of -1 and a upper bound offset of +2. The frame for row with index 5 would range from index 4 to index 6.

like image 21
Peter Avatar answered Sep 28 '22 20:09

Peter