I have a data frame df
where some rows are duplicates with respect to a subset of columns:
A B C
1 Blue Green
2 Red Green
3 Red Green
4 Blue Orange
5 Blue Orange
I would like to remove (or replace with a dummy string) values for duplicate rows with respect to B
and C
, without deleting the whole row, ideally producing:
A B C
1 Blue Green
2 Red Green
3 NaN NaN
4 Blue Orange
5 Nan NaN
As per this thread: Replace duplicate values across columns in Pandas I've tried using pd.Series.duplicated
, however I can't get it to work with duplicates in a subset of columns.
I've also played around with:
is_duplicate = df.loc[df.duplicated(subset=['B','C'])]
df = df.where(is_duplicated==True, 999) # 999 intended as a placeholder that I could find-and-replace later on
However this replaces almost every row with 999
in each column - so clearly I'm doing something wrong. I'd appreciate any advice on how to proceed!
df.loc[df.duplicated(subset=['B','C']), ['B','C']] = np.nan
seems to work for me.
Edited to include @ALollz and @macaw_9227 correction.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With