Assuming I have the following DataFrame:
A | B
1 | Ms
1 | PhD
2 | Ms
2 | Bs
I want to remove the duplicate rows with respect to column A, and I want to retain the row with value 'PhD' in column B as the original, if I don't find a 'PhD', I want to retain the row with 'Bs' in column B.
I am trying to use
df.drop_duplicates('A')
with a condition
Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) . Yields below output.
By default, this method returns a new DataFrame with duplicate rows removed. We can set the argument inplace=True to remove duplicates from the original DataFrame.
The first occurrence is kept and the rest of the duplicates are deleted.
Consider using Categoricals
. They're a nice was to group / order text non-alphabetically (among other things.)
import pandas as pd
df = pd.DataFrame([(1,'Ms'), (1, 'PhD'), (2, 'Ms'), (2, 'Bs'), (3, 'PhD'), (3, 'Bs'), (4, 'Ms'), (4, 'PhD'), (4, 'Bs')], columns=['A', 'B'])
df['B']=df['B'].astype('category')
# after setting the column's type to 'category', you can set the order
df['B']=df['B'].cat.set_categories(['PhD', 'Bs', 'Ms'], ordered=True)
df.sort(['A', 'B'], inplace=True)
df_unique = df.drop_duplicates('A')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With