I was looking at the DataFrame API, i can see two different methods doing the same functionality for removing duplicates from a data set. I can understand dropDuplicates(colNames) will remove duplicates considering only the subset of columns. Is there any other differences between these two methods?

Let's assume we have the following spark dataframe <pre class="prettyprint"><code>+---+------+---+ | id| name|age| +---+------+---+ | 1|Andrew| 25| | 1|Andrew| 25| | 1|Andrew| 26| | 2| Maria| 30| +---+------+---+ </code></pre> <hr> <code>distinct()</code> does not accept any arguments which means that you cannot select which columns need to be taken into account when dropping the duplicates. This means that the following command will drop the duplicate records taking into account all the columns of the dataframe: <pre class="prettyprint"><code>df.distinct().show() +---+------+---+ | id| name|age| +---+------+---+ | 1|Andrew| 26| | 2| Maria| 30| | 1|Andrew| 25| +---+------+---+ </code></pre> Now in case you want to drop the duplicates considering ONLY <code>id</code> and <code>name</code> you'd have to run a <code>select()</code> prior to <code>distinct()</code>. For example, <pre class="prettyprint"><code>>>> df.select(['id', 'name']).distinct().show() +---+------+ | id| name| +---+------+ | 2| Maria| | 1|Andrew| +---+------+ </code></pre> But in case you wanted to drop the duplicates only over a subset of columns like above but keep ALL the columns, then <code>distinct()</code> is not your friend. <hr> <code>dropDuplicates()</code> will drop the duplicates detected over the provided set of columns, but it will also return all the columns appearing in the original dataframe. <pre class="prettyprint"><code>df.dropDuplicates().show() +---+------+---+ | id| name|age| +---+------+---+ | 1|Andrew| 26| | 2| Maria| 30| | 1|Andrew| 25| +---+------+---+ </code></pre> <code>dropDuplicates()</code> is thus more suitable when you want to drop duplicates over a selected subset of columns, but also want to keep all the columns: <pre class="prettyprint"><code>df.dropDuplicates(['id', 'name']).show() +---+------+---+ | id| name|age| +---+------+---+ | 2| Maria| 30| | 1|Andrew| 25| +---+------+---+ </code></pre> <hr> For more details refer to the article distinct() vs dropDuplicates() in Python

Spark SQL DataFrame - distinct() vs dropDuplicates()

3 Answers

The main difference is the consideration of the subset of columns which is great! When using distinct you need a prior .select to select the columns on which you want to apply the duplication and the returned Dataframe contains only these selected columns while dropDuplicates(colNames) will return all the columns of the initial dataframe after removing duplicated rows as per the columns.

109

answered Sep 20 '22 10:09

Bentech

Let's assume we have the following spark dataframe

Click to copy

+---+------+---+                                                                 | id|  name|age| +---+------+---+ |  1|Andrew| 25| |  1|Andrew| 25| |  1|Andrew| 26| |  2| Maria| 30| +---+------+---+

distinct() does not accept any arguments which means that you cannot select which columns need to be taken into account when dropping the duplicates. This means that the following command will drop the duplicate records taking into account all the columns of the dataframe:

Click to copy

df.distinct().show()  +---+------+---+ | id|  name|age| +---+------+---+ |  1|Andrew| 26| |  2| Maria| 30| |  1|Andrew| 25| +---+------+---+

Now in case you want to drop the duplicates considering ONLY id and name you'd have to run a select() prior to distinct(). For example,

Click to copy

>>> df.select(['id', 'name']).distinct().show() +---+------+ | id|  name| +---+------+ |  2| Maria| |  1|Andrew| +---+------+

But in case you wanted to drop the duplicates only over a subset of columns like above but keep ALL the columns, then distinct() is not your friend.

dropDuplicates() will drop the duplicates detected over the provided set of columns, but it will also return all the columns appearing in the original dataframe.

Click to copy

df.dropDuplicates().show()  +---+------+---+ | id|  name|age| +---+------+---+ |  1|Andrew| 26| |  2| Maria| 30| |  1|Andrew| 25| +---+------+---+

dropDuplicates() is thus more suitable when you want to drop duplicates over a selected subset of columns, but also want to keep all the columns:

Click to copy

df.dropDuplicates(['id', 'name']).show()  +---+------+---+ | id|  name|age| +---+------+---+ |  2| Maria| 30| |  1|Andrew| 25| +---+------+---+

For more details refer to the article distinct() vs dropDuplicates() in Python

answered Sep 18 '22 10:09

Giorgos Myrianthous

From javadoc, there is no difference between distinc() and dropDuplicates().

dropDuplicates

public DataFrame dropDuplicates()

Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for distinct.

dropDuplicates() was introduced in 1.4 as a replacement for distinct(), as you can use it's overloaded methods to get unique rows based on subset of columns.

answered Sep 19 '22 10:09

Mrinal

Related questions
                            
                                Play Scala Json Missing Property vs Null
                            
                                QuickSort can't be applied to ArrayBuffer to do sorting In place in Scala
                            
                                How to implement CSRF protection in PlayFramework 2.1.x (Scala)?
                            
                                Scala run process on different working dir
                            
                                Type erasure with parameter defaults
                            
                                Scala Pickling: Writing a custom pickler / unpickler for nested structures
                            
                                How to get the classtag of a path-dependent type
                            
                                Play Framework, Specs2 - Calling controller method directly from unit test
                            
                                How do you write a Scala script that will react to file changes
                            
                                sbt is using wrong scala version rather than using the configuration in build.sbt
                            
                                Scala image resize and crop [closed]
                            
                                Implementing `sequence` on a Monad
                            
                                abstract type in scala
                            
                                Spark ClassNotFoundException running the master
                            
                                How does CanBuildFrom know whether a type can build from another?
                            
                                += on a Vector gives strange / wrong type errors
                            
                                How do lambdas work in Scala, are they functions on top of anonymous classes?
                            
                                Creating serializable objects from Scala source code at runtime
                            
                                Spark RDD's - how do they work
                            
                                How to pattern match a class with multiple argument lists?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark SQL DataFrame - distinct() vs dropDuplicates()

Tags:

scala

apache-spark

apache-spark-sql

pyspark

Shankar

People also ask

3 Answers

Bentech

Giorgos Myrianthous

dropDuplicates

Mrinal

Recent Activity

Donate For Us