I have two columns with a lot of duplicated items per cell in a dataframe. Something similar to this:
Index x y
1 1 ec, us, us, gbr, lst
2 5 ec, us, us, us, us, ec, ec, ec, ec
3 8 ec, us, us, gbr, lst, lst, lst, lst, gbr
4 5 ec, ec, ec, us, us, ir, us, ec, ir, ec, ec
5 7 chn, chn, chn, ec, ec, us, us, gbr, lst
I need to eliminate all the duplicate items an get a resulting dataframe like this:
Index x y
1 1 ec, us, gbr, lst
2 5 ec, us
3 8 ec, us, gbr,lst
4 5 ec, us, ir
5 7 chn, ec, us, gbr, lst
Thanks!!
Split
and apply set
and join
i.e
df['y'].str.split(', ').apply(set).str.join(', ')
0 us, ec, gbr, lst
1 us, ec
2 us, ec, gbr, lst
3 us, ec, ir
4 us, lst, ec, gbr, chn
Name: y, dtype: object
Update based on comment :
df['y'].str.replace('nan|[{}\s]','', regex=True).str.split(',').apply(set).str.join(',').str.strip(',').str.replace(",{2,}",",", regex=True)
# Replace all the braces and nan with `''`, then split and apply set and join
Try this:
d['y'] = d['y'].apply(lambda x: ', '.join(sorted(set(x.split(', ')))))
If you don't care about item order, and assuming the data type of everything in column y
is a string, you can use the following snippet:
df['y'] = df['y'].apply(lambda s: ', '.join(set(s.split(', '))))
The set()
conversion is what removes duplicates. I think in later versions of python it might preserve order (3.4+ maybe?), but that is an implementation detail rather than a language specification.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With