I try to use a logical expression based on a window-function to detect duplicate records:
df
.where(count("*").over(Window.partitionBy($"col1",$"col2"))>lit(1))
.show
this gives in Spark 2.1.1:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate
on the other hand, it works if i assign the result of the window-function to a new column and then filter that column:
df
.withColumn("count", count("*").over(Window.partitionBy($"col1",$"col2"))
.where($"count">lit(1)).drop($"count")
.show
I wonder how I can write this without using an temporary column?
I guess window functions cannot be used within the filter. You have to create an additional column and filter this one.
What you could do is to draw the window function into the select.
df.select(col("1"), col("2"), lag(col("2"), 1).over(window).alias("2_lag"))).filter(col("2_lag")==col("2"))
Then you have it in one statement.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With