Spark remove duplicate rows from DataFrame [duplicate]

Tags:

Assume that I am having a DataFrame like :

val json = sc.parallelize(Seq("""{"a":1, "b":2, "c":22, "d":34}""","""{"a":3, "b":9, "c":22, "d":12}""","""{"a":1, "b":4, "c":23, "d":12}"""))
val df = sqlContext.read.json(json)

I want to remove duplicate rows for column "a" based on the value of column "b". i.e, if there are duplicate rows for column "a", I want to keep the one with larger value for "b". For the above example, after processing, I need only

{"a":3, "b":9, "c":22, "d":12}

and

{"a":1, "b":4, "c":23, "d":12}

Spark DataFrame dropDuplicates API doesn't seem to support this. With the RDD approach, I can do a map().reduceByKey(), but what DataFrame specific operation is there to do this?

Appreciate some help, thanks.

268

asked Feb 19 '16 05:02

void

1 Answers

You can use window function in sparksql to achieve this.

df.registerTempTable("x")
sqlContext.sql("SELECT a, b,c,d  FROM( SELECT *, ROW_NUMBER()OVER(PARTITION BY a ORDER BY b DESC) rn FROM x) y WHERE rn = 1").collect

This will achieve what you need. Read more about Window function suupport https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

162

answered Nov 10 '22 18:11

Pankaj Arora

Related questions
                            
                                Using regex to access values from a map in keys
                            
                                Scala cast to generic type (for generic numerical function)
                            
                                Spark broadcast error: exceeds spark.akka.frameSize Consider using broadcast
                            
                                Dynamic calls to JavaScript in Scala.js
                            
                                Swagger Data Type Model in ImplicitParam with Play Framework
                            
                                Scala Filter List[Int] Which Exists in other List of Tuples
                            
                                Is it possible to use json4s 3.2.11 with Spark 1.3.0?
                            
                                in scala define generic type based on duck typing?
                            
                                How to compare every element in the RDD with every other element in the RDD ?
                            
                                Prohibit resolving during loading in typesafe config
                            
                                Play Framework: How to sort JSON alphabetically
                            
                                How to find playframework version of a project?
                            
                                scala variable arguments :_*
                            
                                Importing Scala in Java: weird classes & methods showing
                            
                                Akka actorSelection vs actorOf Difference
                            
                                UPDATE Cassandra table using spark cassandra connector
                            
                                How to define a function as generic across all numbers in scala?
                            
                                Spark DataFrame filtering: retain element belonging to a list
                            
                                Modifying Map via Monocle
                            
                                Checkpointing In ALS Spark Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark remove duplicate rows from DataFrame [duplicate]

Tags:

dataframe

scala

apache-spark

apache-spark-sql

void

People also ask

1 Answers

Pankaj Arora

Recent Activity

Donate For Us