Is there a simple and efficient way to check a python dataframe just for duplicates (not drop them) based on column(s)?
I want to check if a dataframe has dups based on a combination of columns and if it does, fail the process.
TIA.
➠ Find complete row duplicates: GroupBy can be used along with count() aggregate function on all the columns (using df. ➠ Find column level duplicates: GroupBy with required columns can be used along with count() aggregate function and then filter can be used to get duplicate records.
Method 1: Repeating rows based on column value In this method, we will first make a PySpark DataFrame using createDataFrame(). In our example, the column “Y” has a numerical value that can only be used here to repeat rows. We will use withColumn() function here and its parameter expr will be explained below.
In Pyspark, there are two ways to get the count of distinct values. We can use distinct() and count() functions of DataFrame to get the count distinct of PySpark DataFrame. Another way is to use SQL countDistinct() function which will provide the distinct value count of all the selected columns.
The easiest way would be to check if the number of rows in the dataframe equals the number of rows after dropping duplicates.
if df.count() > df.dropDuplicates([listOfColumns]).count():
raise ValueError('Data has duplicates')
If you also want to actually inspect the duplicates, you can do
df \
.groupby(['column1', 'column2']) \
.count() \
.where('count > 1') \
.sort('count', ascending=False) \
.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With