Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop or replace values within duplicate rows in pandas dataframe

I have a data frame df where some rows are duplicates with respect to a subset of columns:

A    B     C
1    Blue  Green
2    Red   Green
3    Red   Green
4    Blue  Orange
5    Blue  Orange

I would like to remove (or replace with a dummy string) values for duplicate rows with respect to B and C, without deleting the whole row, ideally producing:

A    B     C
1    Blue  Green
2    Red   Green
3    NaN   NaN
4    Blue  Orange
5    Nan   NaN

As per this thread: Replace duplicate values across columns in Pandas I've tried using pd.Series.duplicated, however I can't get it to work with duplicates in a subset of columns.

I've also played around with:

is_duplicate = df.loc[df.duplicated(subset=['B','C'])]
df = df.where(is_duplicated==True, 999)  # 999 intended as a placeholder that I could find-and-replace later on

However this replaces almost every row with 999 in each column - so clearly I'm doing something wrong. I'd appreciate any advice on how to proceed!

like image 721
Lyam Avatar asked Jan 02 '23 00:01

Lyam


1 Answers

df.loc[df.duplicated(subset=['B','C']), ['B','C']] = np.nan seems to work for me.

Edited to include @ALollz and @macaw_9227 correction.

like image 99
Tom Avatar answered Jan 05 '23 00:01

Tom