Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing duplicates from Pandas dataFrame with condition for retaining original

Assuming I have the following DataFrame:

 A | B
 1 | Ms
 1 | PhD
 2 | Ms
 2 | Bs

I want to remove the duplicate rows with respect to column A, and I want to retain the row with value 'PhD' in column B as the original, if I don't find a 'PhD', I want to retain the row with 'Bs' in column B.

I am trying to use

 df.drop_duplicates('A') 

with a condition

like image 552
Rakesh Adhikesavan Avatar asked Oct 09 '15 16:10

Rakesh Adhikesavan


People also ask

How do you drop duplicates in Pandas with conditions?

Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) . Yields below output.

What is a correct method to remove duplicates from a Pandas DataFrame?

By default, this method returns a new DataFrame with duplicate rows removed. We can set the argument inplace=True to remove duplicates from the original DataFrame.

Does remove duplicates keep the first instance Pandas?

The first occurrence is kept and the rest of the duplicates are deleted.


1 Answers

Consider using Categoricals. They're a nice was to group / order text non-alphabetically (among other things.)

import pandas as pd
df = pd.DataFrame([(1,'Ms'), (1, 'PhD'), (2, 'Ms'), (2, 'Bs'), (3, 'PhD'), (3, 'Bs'), (4, 'Ms'), (4, 'PhD'), (4, 'Bs')], columns=['A', 'B'])
df['B']=df['B'].astype('category')
# after setting the column's type to 'category', you can set the order
df['B']=df['B'].cat.set_categories(['PhD', 'Bs', 'Ms'], ordered=True)
df.sort(['A', 'B'], inplace=True)
df_unique = df.drop_duplicates('A')
like image 180
mattvivier Avatar answered Oct 15 '22 16:10

mattvivier