Removing duplicates from Pandas dataFrame with condition for retaining original

Tags:

Assuming I have the following DataFrame:

 A | B
 1 | Ms
 1 | PhD
 2 | Ms
 2 | Bs

I want to remove the duplicate rows with respect to column A, and I want to retain the row with value 'PhD' in column B as the original, if I don't find a 'PhD', I want to retain the row with 'Bs' in column B.

I am trying to use

 df.drop_duplicates('A')

with a condition

552

asked Oct 09 '15 16:10

Rakesh Adhikesavan

1 Answers

Consider using Categoricals. They're a nice was to group / order text non-alphabetically (among other things.)

import pandas as pd
df = pd.DataFrame([(1,'Ms'), (1, 'PhD'), (2, 'Ms'), (2, 'Bs'), (3, 'PhD'), (3, 'Bs'), (4, 'Ms'), (4, 'PhD'), (4, 'Bs')], columns=['A', 'B'])
df['B']=df['B'].astype('category')
# after setting the column's type to 'category', you can set the order
df['B']=df['B'].cat.set_categories(['PhD', 'Bs', 'Ms'], ordered=True)
df.sort(['A', 'B'], inplace=True)
df_unique = df.drop_duplicates('A')

180

answered Oct 15 '22 16:10

mattvivier

Related questions
                            
                                Scrapy concurrency strategy
                            
                                Python:Detect if the current line in file read is the last one
                            
                                Why does Python throw an error when a substring is not found?
                            
                                BeautifulSoup - TypeError: 'NoneType' object is not callable
                            
                                Python: count occurrences in a list using dict comprehension/generator
                            
                                How to structure a Python module to limit exported symbols?
                            
                                Read BSON file in Python?
                            
                                Fastest way to remove first and last lines from a Python string
                            
                                Provide tab title with reportlab generated pdf
                            
                                Getting all constants within a class in python
                            
                                datetime strptime - set format to ignore trailing part of string
                            
                                Raising elements of a list to a power [closed]
                            
                                Is there a way to define list(obj) method on a user defined class in python?
                            
                                Analytical solution for Linear Regression using Python vs. Julia
                            
                                multiprocessing.Pool with maxtasksperchild produces equal PIDs
                            
                                Python Enums with duplicate values
                            
                                Execute Python script from Php
                            
                                How to convert dictionary values to int in Python?
                            
                                Adjust width of box in boxplot in python matplotlib
                            
                                Flask Cache not caching

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Removing duplicates from Pandas dataFrame with condition for retaining original

Tags:

python

pandas

dataframe

Rakesh Adhikesavan

People also ask

1 Answers

mattvivier

Recent Activity

Donate For Us