Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to filter DataFrame using Window function in Spark

Tags:

I try to use a logical expression based on a window-function to detect duplicate records:

df
.where(count("*").over(Window.partitionBy($"col1",$"col2"))>lit(1))
.show

this gives in Spark 2.1.1:

java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate

on the other hand, it works if i assign the result of the window-function to a new column and then filter that column:

df
.withColumn("count", count("*").over(Window.partitionBy($"col1",$"col2"))
.where($"count">lit(1)).drop($"count")
.show

I wonder how I can write this without using an temporary column?

like image 333
Raphael Roth Avatar asked Jun 15 '17 06:06

Raphael Roth


1 Answers

I guess window functions cannot be used within the filter. You have to create an additional column and filter this one.

What you could do is to draw the window function into the select.

df.select(col("1"), col("2"), lag(col("2"), 1).over(window).alias("2_lag"))).filter(col("2_lag")==col("2"))

Then you have it in one statement.

like image 50
Andi Anderle Avatar answered Oct 14 '22 17:10

Andi Anderle