Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Missing data when ordering Pyspark Window

This is my current dataset:

from pyspark.sql import Window
import pyspark.sql.functions as psf

df = spark.createDataFrame([("2","1",1),
                            ("3","2",2)],
                     schema = StructType([StructField("Data",  StringType()),
                                          StructField("Source",StringType()),
                                          StructField("Date",  IntegerType())]))


display(df.withColumn("Result",psf.collect_set("Data").over(Window.partitionBy("Source").orderBy("Date"))))

Output:

Data Source Date Result
2 1 1 ["2"]
3 1 2 ["2","3"]

Why am I missing value 3 in the first row of column Result when using collect_set function over Window that is ordered ?

I have tried to use collect_list as well, but I am getting same results.

My desired output is:

Data Source Date Result
2 1 1 ["2","3"]
3 1 2 ["2","3"]

where the order of values in Result is preserved - first one is where Date = 1 and second one is Date = 2

like image 528
Matus Hmelar Avatar asked Feb 25 '26 09:02

Matus Hmelar


1 Answers

You need to use a Window with unboundedPreceding and Window.unboundedFollowing:

Window.partitionBy("Source").orderBy("Date") \
  .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

By default Spark uses rowsBetween(Window.unboundedPreceding, Window.currentRow) when you have an orderBy

like image 99
blackbishop Avatar answered Feb 27 '26 23:02

blackbishop