Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

check for duplicates in Pyspark Dataframe

Is there a simple and efficient way to check a python dataframe just for duplicates (not drop them) based on column(s)?

I want to check if a dataframe has dups based on a combination of columns and if it does, fail the process.

TIA.

like image 877
Prasanna Saraswathi Krishnan Avatar asked May 01 '18 19:05

Prasanna Saraswathi Krishnan


People also ask

How do I find duplicates in PySpark?

➠ Find complete row duplicates: GroupBy can be used along with count() aggregate function on all the columns (using df. ➠ Find column level duplicates: GroupBy with required columns can be used along with count() aggregate function and then filter can be used to get duplicate records.

How do you duplicate rows in PySpark DataFrame?

Method 1: Repeating rows based on column value In this method, we will first make a PySpark DataFrame using createDataFrame(). In our example, the column “Y” has a numerical value that can only be used here to repeat rows. We will use withColumn() function here and its parameter expr will be explained below.

How do you get distinct in PySpark?

In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.


2 Answers

The easiest way would be to check if the number of rows in the dataframe equals the number of rows after dropping duplicates.

if df.count() > df.dropDuplicates([listOfColumns]).count():
    raise ValueError('Data has duplicates')
like image 120
David Avatar answered Sep 27 '22 20:09

David


If you also want to actually inspect the duplicates, you can do

df \
    .groupby(['column1', 'column2']) \
    .count() \
    .where('count > 1') \
    .sort('count', ascending=False) \
    .show()
like image 39
Konstantin Avatar answered Sep 27 '22 22:09

Konstantin