Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get dropped rows when using drop_duplicates (Pandas DataFrame)?

I use pandas.DataFrame.drop_duplicates() to drop duplicates of rows where all column values are identical, however for data quality analysis, I need to produce a DataFrame with the dropped duplicate rows. How can I identify which are the rows to be dropped? It occurs to me to compare the original DF versus the new one without duplicates and identify the unique indexes missing, but is there a better way to do this?

Example:

import pandas as pd

data =[[1,'A'],[2,'B'],[3,'C'],[1,'A'],[1,'A']]

df = pd.DataFrame(data,columns=['Numbers','Letters'])

df.drop_duplicates(keep='first',inplace=True) # This will drop rows 3 and 4

# Now how to create a dataframe with the duplicate records dropped only?

like image 371
Code Ninja 2C4U Avatar asked Dec 11 '25 08:12

Code Ninja 2C4U


1 Answers

import pandas as pd

data =[[1,'A'],[2,'B'],[3,'C'],[1,'A'],[1,'A']]

df = pd.DataFrame(data,columns=['Numbers','Letters'])


df.drop_duplicates()

Output

    Numbers Letters
0   1       A
1   2       B
2   3       C

and

df.loc[df.duplicated()]

Output

    Numbers Letters
3   1       A
4   1       A
like image 151
Chris Avatar answered Dec 12 '25 21:12

Chris