What row is used in dropDuplicates operator?

1 Answers

TL;DR Keep First (according to row order)

dropDuplicates operator in Spark SQL creates a logical plan with Deduplicate operator.

That Deduplicate operator is translated to First logical operator by Spark SQL's Catalyst Optimizer which answers your question nicely (!)

You can see the Deduplicate operator in the logical plan below.

// create datasets with duplicates
val dups = spark.range(9).map(_ % 3)

val q = dups.dropDuplicates

The following is the logical plan of q dataset.

scala> println(q.queryExecution.logical.numberedTreeString)
00 Deduplicate [value#64L], false
01 +- SerializeFromObject [input[0, bigint, false] AS value#64L]
02    +- MapElements <function1>, class java.lang.Long, [StructField(value,LongType,true)], obj#63: bigint
03       +- DeserializeToObject staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, cast(id#58L as bigint), true), obj#62: java.lang.Long
04          +- Range (0, 9, step=1, splits=Some(8))

Deduplicate operator is then translated to First logical operator (that shows itself as Aggregate operator after optimizations).

scala> println(q.queryExecution.optimizedPlan.numberedTreeString)
00 Aggregate [value#64L], [value#64L]
01 +- SerializeFromObject [input[0, bigint, false] AS value#64L]
02    +- MapElements <function1>, class java.lang.Long, [StructField(value,LongType,true)], obj#63: bigint
03       +- DeserializeToObject staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, id#58L, true), obj#62: java.lang.Long
04          +- Range (0, 9, step=1, splits=Some(8))

After spending some time reviewing the code of Apache Spark, dropDuplicates operator is equivalent to groupBy followed by first function.

first(columnName: String, ignoreNulls: Boolean): Column Aggregate function: returns the first value of a column in a group.

import org.apache.spark.sql.functions.first
val firsts = dups.groupBy("value").agg(first("value") as "value")
scala> println(firsts.queryExecution.logical.numberedTreeString)
00 'Aggregate [value#64L], [value#64L, first('value, false) AS value#139]
01 +- SerializeFromObject [input[0, bigint, false] AS value#64L]
02    +- MapElements <function1>, class java.lang.Long, [StructField(value,LongType,true)], obj#63: bigint
03       +- DeserializeToObject staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, cast(id#58L as bigint), true), obj#62: java.lang.Long
04          +- Range (0, 9, step=1, splits=Some(8))

scala> firsts.explain
== Physical Plan ==
*HashAggregate(keys=[value#64L], functions=[first(value#64L, false)])
+- Exchange hashpartitioning(value#64L, 200)
   +- *HashAggregate(keys=[value#64L], functions=[partial_first(value#64L, false)])
      +- *SerializeFromObject [input[0, bigint, false] AS value#64L]
         +- *MapElements <function1>, obj#63: bigint
            +- *DeserializeToObject staticinvoke(class java.lang.Long, ObjectType(class java.lang.Long), valueOf, id#58L, true), obj#62: java.lang.Long
               +- *Range (0, 9, step=1, splits=8)

I also think that dropDuplicates operator may be more performant.

174

answered Sep 21 '22 02:09

Jacek Laskowski

Related questions
                            
                                Number of Partitions of Spark Dataframe
                            
                                Docker Container with Apache Spark in standalone cluster mode
                            
                                How to use a subquery for dbtable option in jdbc data source?
                            
                                Why there are many spark-warehouse folders got created?
                            
                                Pass variables from Scala to Python in Databricks
                            
                                Getting labels from StringIndexer stages within pipeline in Spark (pyspark)
                            
                                How to convert pyspark.rdd.PipelinedRDD to Data frame with out using collect() method in Pyspark?
                            
                                Spark streaming with python: how to add a UUID column?
                            
                                Difference between batch interval, sliding interval and window size in spark streaming
                            
                                Failed to find data source: com.mongodb.spark.sql.DefaultSource
                            
                                Can I tell spark.read.json that my files are gzipped?
                            
                                How to use spark-avro package to read avro file from spark-shell?
                            
                                Enriching SparkContext without incurring in serialization issues
                            
                                spark reading large file
                            
                                Using Silhouette Clustering in Spark
                            
                                Convert value depending on a type in SparkSQL via case matching of type
                            
                                How to flatten nested lists in PySpark?
                            
                                How to force Spark to evaluate DataFrame operations inline
                            
                                Run Command on EMR Slaves?
                            
                                How does Spark manage stages?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What row is used in dropDuplicates operator?

Tags:

apache-spark

apache-spark-sql

pyspark

Qmage

People also ask

1 Answers

Jacek Laskowski

Recent Activity

Donate For Us