I use pandas.DataFrame.drop_duplicates() to drop duplicates of rows where all column values are identical, however for data quality analysis, I need to produce a DataFrame with the dropped duplicate rows. How can I identify which are the rows to be dropped? It occurs to me to compare the original DF versus the new one without duplicates and identify the unique indexes missing, but is there a better way to do this?
Example:
import pandas as pd
data =[[1,'A'],[2,'B'],[3,'C'],[1,'A'],[1,'A']]
df = pd.DataFrame(data,columns=['Numbers','Letters'])
df.drop_duplicates(keep='first',inplace=True) # This will drop rows 3 and 4
# Now how to create a dataframe with the duplicate records dropped only?
import pandas as pd
data =[[1,'A'],[2,'B'],[3,'C'],[1,'A'],[1,'A']]
df = pd.DataFrame(data,columns=['Numbers','Letters'])
df.drop_duplicates()
Output
Numbers Letters
0 1 A
1 2 B
2 3 C
and
df.loc[df.duplicated()]
Output
Numbers Letters
3 1 A
4 1 A
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With