In Spark Dataframe how to get duplicate records and distinct records in two dataframes?

Tags:

I am working on a problem in which I am loading data from a hive table into spark dataframe and now I want all the unique accts in 1 dataframe and all duplicates in another. for example if I have acct id 1,1,2,3,4. I want to get 2,3,4 in one dataframe and 1,1 in another. How can I do this?

224

asked Oct 13 '16 16:10

Shekhar

2 Answers

val acctDF = List(("1", "Acc1"), ("1", "Acc1"), ("1", "Acc1"), ("2", "Acc2"), ("2", "Acc2"), ("3", "Acc3")).toDF("AcctId", "Details")
scala> acctDF.show()
+------+-------+
|AcctId|Details|
+------+-------+
|     1|   Acc1|
|     1|   Acc1|
|     1|   Acc1|
|     2|   Acc2|
|     2|   Acc2|
|     3|   Acc3|
+------+-------+
// Need to convert the DF to rdd to apply map and reduceByKey and again to DF to use it further more

val countsDF = acctDF.rdd.map(rec => (rec(0), 1)).reduceByKey(_+_).map(rec=> (rec._1.toString, rec._2)).toDF("AcctId", "AcctCount")

val accJoinedDF = acctDF.join(countsDF, acctDF("AcctId")===countsDF("AcctId"), "left_outer").select(acctDF("AcctId"), acctDF("Details"), countsDF("AcctCount"))

scala> accJoinedDF.show()
+------+-------+---------+   
|AcctId|Details|AcctCount|
+------+-------+---------+
|     1|   Acc1|        3|
|     1|   Acc1|        3|
|     1|   Acc1|        3|
|     2|   Acc2|        2|
|     2|   Acc2|        2|
|     3|   Acc3|        1|
+------+-------+---------+


val distAcctDF = accJoinedDF.filter($"AcctCount"===1)
scala> distAcctDF.show()
+------+-------+---------+   
|AcctId|Details|AcctCount|
+------+-------+---------+
|     3|   Acc3|        1|
+------+-------+---------+

val duplAcctDF = accJoinedDF.filter($"AcctCount">1)
scala> duplAcctDF.show()
+------+-------+---------+                 
|AcctId|Details|AcctCount|
+------+-------+---------+
|     1|   Acc1|        3|
|     1|   Acc1|        3|
|     1|   Acc1|        3|
|     2|   Acc2|        2|
|     2|   Acc2|        2|
+------+-------+---------+

(OR scala> duplAcctDF.distinct.show() )

177

answered Sep 24 '22 15:09

KiranM

Depending on the version of spark you have, you could use window functions in datasets/sql like below:

Dataset<Row> New = df.withColumn("Duplicate", count("*").over( Window.partitionBy("id") ) );

Dataset<Row> Dups = New.filter(col("Duplicate").gt(1));

Dataset<Row> Uniques = New.filter(col("Duplicate").equalTo(1));

the above is written in java. should be similar in scala and read this on how to do in python. https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

answered Sep 24 '22 15:09

valearner

Related questions
                            
                                Underscore in List.filter
                            
                                How to add some action in constructor?
                            
                                Why can't i define a variable recursively in a code block?
                            
                                Scala REPL: How to find function type?
                            
                                Create array of literals and columns from List of Strings in Spark SQL
                            
                                How to convert Row to json in Spark 2 Scala
                            
                                Basic Scala Errors That Make No Sense
                            
                                Eclipse Indigo: how do I uninstall a plugin if the Help -> About Eclipse SDK dialog won't display?
                            
                                Dealing with the surprising lack of ParList in scala.collections.parallel
                            
                                How to use the function type in scala within defined in type meaningfully?
                            
                                Scala: Contains in mutable and immutable sets
                            
                                How to correctly get current loop count from a Iterator in scala
                            
                                Cool class and method names wrapped in ``: class `This is a cool class` {}?
                            
                                What are the differences between ++= and += in sbt, say with libraryDependencies?
                            
                                Access maven repo over https in sbt
                            
                                Do methods ending with _! have a special meaning in Scala?
                            
                                Comparing String and Enumeration
                            
                                How to initialize covariant variable?
                            
                                Merging a list of Strings using mkString vs foldRight
                            
                                Why shouldn't one make every Scala instance variable a lazily initialized one?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In Spark Dataframe how to get duplicate records and distinct records in two dataframes?

Tags:

scala

apache-spark

Shekhar

People also ask

2 Answers

KiranM

valearner

Recent Activity

Donate For Us