Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark SQL DataFrame - distinct() vs dropDuplicates()

I was looking at the DataFrame API, i can see two different methods doing the same functionality for removing duplicates from a data set.

I can understand dropDuplicates(colNames) will remove duplicates considering only the subset of columns.

Is there any other differences between these two methods?

like image 775
Shankar Avatar asked Feb 27 '16 07:02

Shankar


People also ask

How do I select distinct in spark?

Spark doesn't have a distinct method that takes columns that should run distinct on however, Spark provides another signature of dropDuplicates() function which takes multiple columns to eliminate duplicates. Note that calling dropDuplicates() on DataFrame returns a new DataFrame with duplicate rows removed.

How can I find duplicates in spark?

➠ Find complete row duplicates: GroupBy can be used along with count() aggregate function on all the columns (using df. ➠ Find column level duplicates: GroupBy with required columns can be used along with count() aggregate function and then filter can be used to get duplicate records.

What is the difference between select and selectExpr in spark?

Therefore, select() method is useful when you simply need to select a subset of columns from a particular Spark DataFrame. On the other hand, selectExpr() comes in handy when you need to select particular columns while at the same time you also need to apply some sort of transformation over particular column(s).


3 Answers

The main difference is the consideration of the subset of columns which is great! When using distinct you need a prior .select to select the columns on which you want to apply the duplication and the returned Dataframe contains only these selected columns while dropDuplicates(colNames) will return all the columns of the initial dataframe after removing duplicated rows as per the columns.

like image 109
Bentech Avatar answered Sep 20 '22 10:09

Bentech


Let's assume we have the following spark dataframe

+---+------+---+                                                                 | id|  name|age| +---+------+---+ |  1|Andrew| 25| |  1|Andrew| 25| |  1|Andrew| 26| |  2| Maria| 30| +---+------+---+ 

distinct() does not accept any arguments which means that you cannot select which columns need to be taken into account when dropping the duplicates. This means that the following command will drop the duplicate records taking into account all the columns of the dataframe:

df.distinct().show()  +---+------+---+ | id|  name|age| +---+------+---+ |  1|Andrew| 26| |  2| Maria| 30| |  1|Andrew| 25| +---+------+---+ 

Now in case you want to drop the duplicates considering ONLY id and name you'd have to run a select() prior to distinct(). For example,

>>> df.select(['id', 'name']).distinct().show() +---+------+ | id|  name| +---+------+ |  2| Maria| |  1|Andrew| +---+------+ 

But in case you wanted to drop the duplicates only over a subset of columns like above but keep ALL the columns, then distinct() is not your friend.


dropDuplicates() will drop the duplicates detected over the provided set of columns, but it will also return all the columns appearing in the original dataframe.

df.dropDuplicates().show()  +---+------+---+ | id|  name|age| +---+------+---+ |  1|Andrew| 26| |  2| Maria| 30| |  1|Andrew| 25| +---+------+---+ 

dropDuplicates() is thus more suitable when you want to drop duplicates over a selected subset of columns, but also want to keep all the columns:

df.dropDuplicates(['id', 'name']).show()  +---+------+---+ | id|  name|age| +---+------+---+ |  2| Maria| 30| |  1|Andrew| 25| +---+------+---+ 

For more details refer to the article distinct() vs dropDuplicates() in Python

like image 35
Giorgos Myrianthous Avatar answered Sep 18 '22 10:09

Giorgos Myrianthous


From javadoc, there is no difference between distinc() and dropDuplicates().

dropDuplicates

public DataFrame dropDuplicates()

Returns a new DataFrame that contains only the unique rows from this DataFrame. This is an alias for distinct.

dropDuplicates() was introduced in 1.4 as a replacement for distinct(), as you can use it's overloaded methods to get unique rows based on subset of columns.

like image 42
Mrinal Avatar answered Sep 19 '22 10:09

Mrinal