Below shows a column with data I have and another column with the de-duplicated data I want.
I honestly don't even know how to start doing this in Python code. I've read a couple of posts on this in R, but not in Python.
If you're looking to get rid of consecutive duplicates only, this should suffice:
df['Desired'] = df['Current'].str.replace(r'\b(\w+)(\s+\1)+\b', r'\1')
df
Current Desired
0 Racoon Dog Racoon Dog
1 Cat Cat Cat
2 Dog Dog Dog Dog Dog
3 Rat Fox Chicken Rat Fox Chicken
Details
\b # word boundary
(\w+) # 1st capture group of a single word
(
\s+ # 1 or more spaces
\1 # reference to first group
)+ # one or more repeats
\b
Regex from here.
To remove non-consecutive duplicates, I'd suggest a solution involving the OrderedDict
data structure:
from collections import OrderedDict
df['Desired'] = (df['Current'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))
df
Current Desired
0 Racoon Dog Racoon Dog
1 Cat Cat Cat
2 Dog Dog Dog Dog Dog
3 Rat Fox Chicken Rat Fox Chicken
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With