Unable to filter DataFrame using Window function in Spark

Question

I try to use a logical expression based on a window-function to detect duplicate records:

df
.where(count("*").over(Window.partitionBy($"col1",$"col2"))>lit(1))
.show

this gives in Spark 2.1.1:

java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate

on the other hand, it works if i assign the result of the window-function to a new column and then filter that column:

df
.withColumn("count", count("*").over(Window.partitionBy($"col1",$"col2"))
.where($"count">lit(1)).drop($"count")
.show

I wonder how I can write this without using an temporary column?

Andi Anderle · Accepted Answer

I guess window functions cannot be used within the filter. You have to create an additional column and filter this one.

What you could do is to draw the window function into the select.

df.select(col("1"), col("2"), lag(col("2"), 1).over(window).alias("2_lag"))).filter(col("2_lag")==col("2"))

Then you have it in one statement.

Unable to filter DataFrame using Window function in Spark

Tags:

Raphael Roth

1 Answers

Andi Anderle

Recent Activity

Donate For Us

Unable to filter DataFrame using Window function in Spark

Tags:

Raphael Roth

1 Answers

Andi Anderle

Related questions

Recent Activity

Donate For Us