PySpark / Spark Window Function First/ Last Issue

Tags:

From my understanding first/ last function in Spark will retrieve first / last row of each partition/ I am not able to understand why LAST function is giving incorrect results.

This is my code.

AgeWindow = Window.partitionBy('Dept').orderBy('Age')
df1 = df1.withColumn('first(ID)', first('ID').over(AgeWindow))\
        .withColumn('last(ID)', last('ID').over(AgeWindow))           
df1.show()

+---+----------+---+--------+--------------------------+-------------------------+
|Age|      Dept| ID|    Name|first(ID)                 |last(ID)                |
+---+----------+---+--------+--------------------------+-------------------------+
| 38|  medicine|  4|   harry|                         4|                        4|
| 41|  medicine|  5|hermione|                         4|                        5|
| 55|  medicine|  7| gandalf|                         4|                        7|
| 15|technology|  6|  sirius|                         6|                        6|
| 49|technology|  9|     sam|                         6|                        9|
| 88|technology|  1|     sam|                         6|                        2|
| 88|technology|  2|     nik|                         6|                        2|
| 75|       mba|  8|   ginny|                         8|                       11|
| 75|       mba| 10|     sam|                         8|                       11|
| 75|       mba|  3|     ron|                         8|                       11|
| 75|       mba| 11|     ron|                         8|                       11|
+---+----------+---+--------+--------------------------+-------------------------+

679

asked Sep 11 '18 09:09

Nikhil Redij

1 Answers

It is not incorrect. Your window definition is just not what you think it is.

If you provide ORDER BY clause then the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW:

from pyspark.sql.window import Window
from pyspark.sql.functions import first, last

w = Window.partitionBy('Dept').orderBy('Age')

df = spark.createDataFrame(
    [(38, "medicine", 4), (41, "medicine", 5), (55, "medicine", 7)],
    ("Age", "Dept", "ID")
)

df.select(
    "*",
    first('ID').over(w).alias("first_id"), 
    last('ID').over(w).alias("last_id")
).explain()

== Physical Plan ==
Window [first(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS first_id#38L, last(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), currentrow$())) AS last_id#40L], [Dept#23], [Age#22L ASC NULLS FIRST]
+- *(1) Sort [Dept#23 ASC NULLS FIRST, Age#22L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(Dept#23, 200)
      +- Scan ExistingRDD[Age#22L,Dept#23,ID#24L]

This means that the window function never looks ahead and the last row in the frame is the current row.

You should redefine the window as

w_uf = (Window
   .partitionBy('Dept')
   .orderBy('Age')
   .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))

result = df.select(
    "*", 
    first('ID').over(w_uf).alias("first_id"),
    last('ID').over(w_uf).alias("last_id")
)

== Physical Plan ==
Window [first(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS first_id#56L, last(ID#24L, false) windowspecdefinition(Dept#23, Age#22L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS last_id#58L], [Dept#23], [Age#22L ASC NULLS FIRST]
+- *(1) Sort [Dept#23 ASC NULLS FIRST, Age#22L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(Dept#23, 200)
      +- Scan ExistingRDD[Age#22L,Dept#23,ID#24L]

result.show()

+---+--------+---+--------+-------+
|Age|    Dept| ID|first_id|last_id|
+---+--------+---+--------+-------+
| 38|medicine|  4|       4|      7|
| 41|medicine|  5|       4|      7|
| 55|medicine|  7|       4|      7|
+---+--------+---+--------+-------+

191

answered Oct 05 '22 17:10

zero323

Related questions
                            
                                SQL 2005 CTE vs TEMP table Performance when used in joins of other tables
                            
                                How to retrieve .NET type of given StoredProcedure's Parameter in SQL?
                            
                                SQL INSERT but avoid duplicates
                            
                                Insert into a table using a view
                            
                                sql nested case statements
                            
                                What does a caret (^) do in a SQL query?
                            
                                Tracing Rails 3 SQL queries
                            
                                Performance of Tables vs. Views
                            
                                ARel mimic includes with find_by_sql
                            
                                Why is RAND() not producing random numbers?
                            
                                How to use @@ROWCOUNT in IF statement as well as within BEGIN..END block?
                            
                                How can I get a random cartesian product in PostgreSQL?
                            
                                How to view all the Metadata of columns of a table in oracle database?
                            
                                Using Oracle SQL, how does one output day number of week and day of week?
                            
                                How to Delete Records NOT IN
                            
                                What is the equivalent of "CASE WHEN THEN" (T-SQL) with Entity Framework?
                            
                                Output of adding an integer and a String in SQL Server
                            
                                Pandas is faster to load CSV than SQL
                            
                                How to create a dependency list for an object in Redshift?
                            
                                Transform table to one-hot-encoding of single column value

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PySpark / Spark Window Function First/ Last Issue

Tags:

sql

window-functions

apache-spark

apache-spark-sql

pyspark

Nikhil Redij

People also ask

1 Answers

zero323

Recent Activity

Donate For Us