I have a string as follows where I need to remove similar consecutive words.
mystring = "my friend's new new new new and old old cats are running running in the street"
My output should look as follows.
myoutput = "my friend's new and old cats are running in the street"
I am using the following python code to do it.
mylist = []
for i, w in enumerate(mystring.split()):
for n, l in enumerate(mystring.split()):
if l != w and i == n-1:
mylist.append(w)
mylist.append(mystring.split()[-1])
myoutput = " ".join(mylist)
However, my code is O(n²) and really inefficient as I have a huge dataset. I am wondering if there is a more efficient way of doing this in Python.
I am happy to provide more details if needed.
You can remove duplicates from a Python using the dict. fromkeys(), which generates a dictionary that removes any duplicate values. You can also convert a list to a set. You must convert the dictionary or set back into a list to see a list whose duplicates have been removed.
Short regex magic:
import re
mystring = "my friend's new new new new and old old cats are running running in the street"
res = re.sub(r'\b(\w+\s*)\1{1,}', '\\1', mystring)
print(res)
regex pattern details:
\b
- word boundary(\w+\s*)
- one or more word chars \w+
followed by any number of whitespace characters \s*
- enclosed into a captured group (...)
\1{1,}
- refers to the 1st captured group occurred one or more times {1,}
The output:
my friend's new and old cats are running in the street
Using itertools.groupby
:
import itertools
>> ' '.join(k for k, _ in itertools.groupby(mystring.split()))
"my friend's new and old cats are running in the street"
mystring.split()
splits the mystring
.itertools.groupby
efficiently groups the consecutive words by k
.The complexity is linear in the size of the input string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With