I explain my question through an example: Let us assume we have a dataframe as follows: <pre class="prettyprint"><code>original_df = sc.createDataFrame([('x', 10,), ('x', 15,), ('x', 10,), ('x', 25,), ('y', 20,), ('y', 10,), ('y', 20,)], ["key", "price"] ) original_df.show() </code></pre> Output: <pre class="prettyprint"><code>+---+-----+ |key|price| +---+-----+ | x| 10| | x| 15| | x| 10| | x| 25| | y| 20| | y| 10| | y| 20| +---+-----+ </code></pre> And assume I want to get a list of <code>prices</code> for each <code>key</code> using <code>window</code>: <pre class="prettyprint"><code>w = Window.partitionBy('key') original_df.withColumn('price_list', F.collect_list('price').over(w)).show() </code></pre> Output: <pre class="prettyprint"><code>+---+-----+----------------+ |key|price| price_list| +---+-----+----------------+ | x| 10|[10, 15, 10, 25]| | x| 15|[10, 15, 10, 25]| | x| 10|[10, 15, 10, 25]| | x| 25|[10, 15, 10, 25]| | y| 20| [20, 10, 20]| | y| 10| [20, 10, 20]| | y| 20| [20, 10, 20]| +---+-----+----------------+ </code></pre> So far so good. But if I want to get an ordered list, and I add <code>orderBy</code> to my window <code>w</code> I get: <pre class="prettyprint"><code>w = Window.partitionBy('key').orderBy('price') original_df.withColumn('ordered_list', F.collect_list('price').over(w)).show() </code></pre> Output: <pre class="prettyprint"><code>+---+-----+----------------+ |key|price| ordered_list| +---+-----+----------------+ | x| 10| [10, 10]| | x| 10| [10, 10]| | x| 15| [10, 10, 15]| | x| 25|[10, 10, 15, 25]| | y| 10| [10]| | y| 20| [10, 20, 20]| | y| 20| [10, 20, 20]| +---+-----+----------------+ </code></pre> Which means <code>orderBy</code> (kind of) changed the rows (same as what <code>rowsBetween</code> does) in the window as well! Which it's not supposed to do. Eventhough I can fix it by specifying <code>rowsBetween</code> in the window and get the expected results, <pre class="prettyprint"><code>w = Window.partitionBy('key').orderBy('price').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) </code></pre> can someone explain why <code>orderBy</code> affects <code>window</code> in that way?

Spark Window are specified using three parts: partition, order and frame. <ol> <li>When none of the parts are specified then whole dataset would be considered as a single window.</li> <li>When partition is specified using a column, one window per distinct value of the column is created. If only partition is specified, then when a when is evaluated for a row, all the rows in that partition would taken into account. Thats why you see all 4 values [10, 15, 10, 25] for all the rows in partition x.</li> <li>When partition and ordering is specified, then when row function is evaluated it takes the rank order of rows in partition and all the rows which has same or lower value (if default asc order is specified) rank are included. In your case, first row includes [10,10] because there 2 rows in the partition with the same rank.</li> <li>When Frame specification rowsBetween and rangeBetween are specified, then row evaluation would pick only those rows which matched frame rule. e.g. unbounded and currentRow is specified it would pick current row and all rows that occur before it. If orderBy is specified, it would change which rows occur before the current row accordingly.</li> </ol> specifically to your question, orderBy is not only to sort the partitioned data but it also change the row frame selection Below are different windowspec and the corresponding output <pre class="prettyprint"><code>Window.orderBy() +---+-----+----------------------------+ |key|price|price_list | +---+-----+----------------------------+ |x |15 |[15, 10, 10, 20, 10, 25, 20]| |x |10 |[15, 10, 10, 20, 10, 25, 20]| |y |10 |[15, 10, 10, 20, 10, 25, 20]| |y |20 |[15, 10, 10, 20, 10, 25, 20]| |x |10 |[15, 10, 10, 20, 10, 25, 20]| |x |25 |[15, 10, 10, 20, 10, 25, 20]| |y |20 |[15, 10, 10, 20, 10, 25, 20]| +---+-----+----------------------------+ Window.partitionBy('key') +---+-----+----------------+ |key|price| price_list| +---+-----+----------------+ | x| 15|[15, 10, 10, 25]| | x| 10|[15, 10, 10, 25]| | x| 10|[15, 10, 10, 25]| | x| 25|[15, 10, 10, 25]| | y| 20| [20, 10, 20]| | y| 10| [20, 10, 20]| | y| 20| [20, 10, 20]| +---+-----+----------------+ Window.partitionBy('key').orderBy('price') +---+-----+----------------+ |key|price| ordered_list| +---+-----+----------------+ | x| 10| [10, 10]| | x| 10| [10, 10]| | x| 15| [10, 10, 15]| | x| 25|[10, 10, 15, 25]| | y| 10| [10]| | y| 20| [10, 20, 20]| | y| 20| [10, 20, 20]| +---+-----+----------------+ w = Window.partitionBy('key').orderBy(F.desc('price')) +---+-----+----------------+ |key|price| ordered_list| +---+-----+----------------+ | x| 25| [25]| | x| 15| [25, 15]| | x| 10|[25, 15, 10, 10]| | x| 10|[25, 15, 10, 10]| | y| 20| [20, 20]| | y| 20| [20, 20]| | y| 10| [20, 20, 10]| +---+-----+----------------+ Window.partitionBy('key').orderBy('price').rowsBetween(Window.unboundedPreceding, Window.currentRow) +---+-----+----------------+ |key|price| ordered_list| +---+-----+----------------+ | x| 10| [10]| | x| 10| [10, 10]| | x| 15| [10, 10, 15]| | x| 25|[10, 10, 15, 25]| | y| 10| [10]| | y| 20| [10, 20]| | y| 20| [10, 20, 20]| +---+-----+----------------+ Window.partitionBy('key').rowsBetween(Window.unboundedPreceding, Window.currentRow) +---+-----+----------------+ |key|price| ordered_list| +---+-----+----------------+ | x| 15| [15]| | x| 10| [15, 10]| | x| 10| [15, 10, 10]| | x| 25|[15, 10, 10, 25]| | y| 10| [10]| | y| 20| [10, 20]| | y| 20| [10, 20, 20]| +---+-----+----------------+ </code></pre>

How orderBy affects Window.partitionBy in Pyspark dataframe?

Tags:

sql-order-by

window

pyspark

I explain my question through an example:
Let us assume we have a dataframe as follows:

original_df = sc.createDataFrame([('x', 10,), ('x', 15,), ('x', 10,), ('x', 25,), ('y', 20,), ('y', 10,), ('y', 20,)], ["key", "price"] )
original_df.show()

Output:

+---+-----+
|key|price|
+---+-----+
|  x|   10|
|  x|   15|
|  x|   10|
|  x|   25|
|  y|   20|
|  y|   10|
|  y|   20|
+---+-----+

And assume I want to get a list of prices for each key using window:

w = Window.partitionBy('key')
original_df.withColumn('price_list', F.collect_list('price').over(w)).show()

Output:

+---+-----+----------------+
|key|price|      price_list|
+---+-----+----------------+
|  x|   10|[10, 15, 10, 25]|
|  x|   15|[10, 15, 10, 25]|
|  x|   10|[10, 15, 10, 25]|
|  x|   25|[10, 15, 10, 25]|
|  y|   20|    [20, 10, 20]|
|  y|   10|    [20, 10, 20]|
|  y|   20|    [20, 10, 20]|
+---+-----+----------------+

So far so good.
But if I want to get an ordered list, and I add orderBy to my window w I get:

w = Window.partitionBy('key').orderBy('price')
original_df.withColumn('ordered_list', F.collect_list('price').over(w)).show()

Output:

+---+-----+----------------+
|key|price|    ordered_list|
+---+-----+----------------+
|  x|   10|        [10, 10]|
|  x|   10|        [10, 10]|
|  x|   15|    [10, 10, 15]|
|  x|   25|[10, 10, 15, 25]|
|  y|   10|            [10]|
|  y|   20|    [10, 20, 20]|
|  y|   20|    [10, 20, 20]|
+---+-----+----------------+

Which means orderBy (kind of) changed the rows (same as what rowsBetween does) in the window as well! Which it's not supposed to do.

Eventhough I can fix it by specifying rowsBetween in the window and get the expected results,

w = Window.partitionBy('key').orderBy('price').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

can someone explain why orderBy affects window in that way?

585

asked Dec 06 '18 08:12

Ala Tarighati

1 Answers

Spark Window are specified using three parts: partition, order and frame.

When none of the parts are specified then whole dataset would be considered as a single window.
When partition is specified using a column, one window per distinct value of the column is created. If only partition is specified, then when a when is evaluated for a row, all the rows in that partition would taken into account. Thats why you see all 4 values [10, 15, 10, 25] for all the rows in partition x.
When partition and ordering is specified, then when row function is evaluated it takes the rank order of rows in partition and all the rows which has same or lower value (if default asc order is specified) rank are included. In your case, first row includes [10,10] because there 2 rows in the partition with the same rank.
When Frame specification rowsBetween and rangeBetween are specified, then row evaluation would pick only those rows which matched frame rule. e.g. unbounded and currentRow is specified it would pick current row and all rows that occur before it. If orderBy is specified, it would change which rows occur before the current row accordingly.

specifically to your question, orderBy is not only to sort the partitioned data but it also change the row frame selection

Below are different windowspec and the corresponding output

Window.orderBy()
+---+-----+----------------------------+
|key|price|price_list                  |
+---+-----+----------------------------+
|x  |15   |[15, 10, 10, 20, 10, 25, 20]|
|x  |10   |[15, 10, 10, 20, 10, 25, 20]|
|y  |10   |[15, 10, 10, 20, 10, 25, 20]|
|y  |20   |[15, 10, 10, 20, 10, 25, 20]|
|x  |10   |[15, 10, 10, 20, 10, 25, 20]|
|x  |25   |[15, 10, 10, 20, 10, 25, 20]|
|y  |20   |[15, 10, 10, 20, 10, 25, 20]|
+---+-----+----------------------------+

Window.partitionBy('key')
+---+-----+----------------+
|key|price|      price_list|
+---+-----+----------------+
|  x|   15|[15, 10, 10, 25]|
|  x|   10|[15, 10, 10, 25]|
|  x|   10|[15, 10, 10, 25]|
|  x|   25|[15, 10, 10, 25]|
|  y|   20|    [20, 10, 20]|
|  y|   10|    [20, 10, 20]|
|  y|   20|    [20, 10, 20]|
+---+-----+----------------+

Window.partitionBy('key').orderBy('price')
+---+-----+----------------+
|key|price|    ordered_list|
+---+-----+----------------+
|  x|   10|        [10, 10]|
|  x|   10|        [10, 10]|
|  x|   15|    [10, 10, 15]|
|  x|   25|[10, 10, 15, 25]|
|  y|   10|            [10]|
|  y|   20|    [10, 20, 20]|
|  y|   20|    [10, 20, 20]|
+---+-----+----------------+

w = Window.partitionBy('key').orderBy(F.desc('price'))
+---+-----+----------------+
|key|price|    ordered_list|
+---+-----+----------------+
|  x|   25|            [25]|
|  x|   15|        [25, 15]|
|  x|   10|[25, 15, 10, 10]|
|  x|   10|[25, 15, 10, 10]|
|  y|   20|        [20, 20]|
|  y|   20|        [20, 20]|
|  y|   10|    [20, 20, 10]|
+---+-----+----------------+

Window.partitionBy('key').orderBy('price').rowsBetween(Window.unboundedPreceding, Window.currentRow)
+---+-----+----------------+
|key|price|    ordered_list|
+---+-----+----------------+
|  x|   10|            [10]|
|  x|   10|        [10, 10]|
|  x|   15|    [10, 10, 15]|
|  x|   25|[10, 10, 15, 25]|
|  y|   10|            [10]|
|  y|   20|        [10, 20]|
|  y|   20|    [10, 20, 20]|
+---+-----+----------------+

Window.partitionBy('key').rowsBetween(Window.unboundedPreceding, Window.currentRow)
+---+-----+----------------+
|key|price|    ordered_list|
+---+-----+----------------+
|  x|   15|            [15]|
|  x|   10|        [15, 10]|
|  x|   10|    [15, 10, 10]|
|  x|   25|[15, 10, 10, 25]|
|  y|   10|            [10]|
|  y|   20|        [10, 20]|
|  y|   20|    [10, 20, 20]|
+---+-----+----------------+

157

answered Oct 29 '22 15:10

Manoj Singh

Related questions
                            
                                Stack Overflow while processing several columns with a UDF
                            
                                first_value windowing function in pyspark
                            
                                In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)?
                            
                                check if a row value is null in spark dataframe
                            
                                Querying json object in dataframe using Pyspark
                            
                                Filter PySpark DataFrame by checking if string appears in column
                            
                                Pyspark 'NoneType' object has no attribute '_jvm' error
                            
                                Pandas scalar UDF failing, IllegalArgumentException
                            
                                Spark ALS predictAll returns empty
                            
                                withColumn not allowing me to use max() function to generate a new column
                            
                                How to append to a csv file using df.write.csv in pyspark?
                            
                                IF Statement Pyspark
                            
                                Difference in usecases for AWS Sagemaker vs Databricks?
                            
                                How to check a file/folder is present using pyspark without getting exception
                            
                                Why does a PySpark UDF that operates on a column generated by rand() fail?
                            
                                Spark does't run in Windows anymore
                            
                                NumPy exception when using MLlib even though Numpy is installed
                            
                                Convert date to end of month in Spark
                            
                                replace values of one column in a spark df by dictionary key-values (pyspark)
                            
                                pyspark - Convert sparse vector obtained after one hot encoding into columns

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How orderBy affects Window.partitionBy in Pyspark dataframe?

Tags:

sql-order-by

window

pyspark

Ala Tarighati

People also ask

1 Answers

Manoj Singh

Recent Activity

Donate For Us