Missing data when ordering Pyspark Window

Question

This is my current dataset:

from pyspark.sql import Window
import pyspark.sql.functions as psf

df = spark.createDataFrame([("2","1",1),
                            ("3","2",2)],
                     schema = StructType([StructField("Data",  StringType()),
                                          StructField("Source",StringType()),
                                          StructField("Date",  IntegerType())]))


display(df.withColumn("Result",psf.collect_set("Data").over(Window.partitionBy("Source").orderBy("Date"))))

Output:

Data	Source	Date	Result
2	1	1	["2"]
3	1	2	["2","3"]

Why am I missing value 3 in the first row of column Result when using collect_set function over Window that is ordered ?

I have tried to use collect_list as well, but I am getting same results.

My desired output is:

Data	Source	Date	Result
2	1	1	["2","3"]
3	1	2	["2","3"]

where the order of values in Result is preserved - first one is where Date = 1 and second one is Date = 2

blackbishop · Accepted Answer

You need to use a Window with unboundedPreceding and Window.unboundedFollowing:

Window.partitionBy("Source").orderBy("Date") \
  .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)

By default Spark uses rowsBetween(Window.unboundedPreceding, Window.currentRow) when you have an orderBy

Missing data when ordering Pyspark Window

Tags:

apache-spark

apache-spark-sql

pyspark

Matus Hmelar

1 Answers

blackbishop

Recent Activity

Donate For Us

Missing data when ordering Pyspark Window

Tags:

apache-spark

apache-spark-sql

pyspark

Matus Hmelar

1 Answers

blackbishop

Related questions

Recent Activity

Donate For Us