Remove all records which are duplicate in spark dataframe

Question

I have a spark dataframe with multiple columns in it. I want to find out and remove rows which have duplicated values in a column (the other columns can be different).

I tried using dropDuplicates(col_name) but it will only drop duplicate entries but still keep one record in the dataframe. What I need is to remove all entries which were initially containing duplicate entries.

I am using Spark 1.6 and Scala 2.10.

Raphael Roth · Accepted Answer

I would use window-functions for this. Lets say you want to remove duplicate id rows :

import org.apache.spark.sql.expressions.Window

df
  .withColumn("cnt", count("*").over(Window.partitionBy($"id")))
  .where($"cnt"===1).drop($"cnt")
  .show()

Remove all records which are duplicate in spark dataframe

Tags:

duplicates

scala

apache-spark

apache-spark-sql

spark-dataframe

salmanbw

1 Answers

Raphael Roth

Recent Activity

Donate For Us

Remove all records which are duplicate in spark dataframe

Tags:

duplicates

scala

apache-spark

apache-spark-sql

spark-dataframe

salmanbw

1 Answers

Raphael Roth

Related questions

Recent Activity

Donate For Us