Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove consecutive identical words from a string in python

Tags:

python

I have a string as follows where I need to remove similar consecutive words.

mystring = "my friend's new new new new and old old cats are running running in the street"

My output should look as follows.

myoutput = "my friend's new and old cats are running in the street"

I am using the following python code to do it.

 mylist = []
 for i, w in enumerate(mystring.split()):
     for n, l in enumerate(mystring.split()):
             if l != w and i == n-1:
                     mylist.append(w)
 mylist.append(mystring.split()[-1])
 myoutput = " ".join(mylist)

However, my code is O(n²) and really inefficient as I have a huge dataset. I am wondering if there is a more efficient way of doing this in Python.

I am happy to provide more details if needed.

like image 478
EmJ Avatar asked Jul 27 '19 10:07

EmJ


People also ask

How do I remove similar words from a list in Python?

You can remove duplicates from a Python using the dict. fromkeys(), which generates a dictionary that removes any duplicate values. You can also convert a list to a set. You must convert the dictionary or set back into a list to see a list whose duplicates have been removed.


2 Answers

Short regex magic:

import re

mystring = "my friend's new new new new and old old cats are running running in the street"
res = re.sub(r'\b(\w+\s*)\1{1,}', '\\1', mystring)
print(res)

regex pattern details:

  • \b - word boundary
  • (\w+\s*) - one or more word chars \w+ followed by any number of whitespace characters \s* - enclosed into a captured group (...)
  • \1{1,} - refers to the 1st captured group occurred one or more times {1,}

The output:

my friend's new and old cats are running in the street
like image 116
RomanPerekhrest Avatar answered Oct 07 '22 01:10

RomanPerekhrest


Using itertools.groupby:

import itertools

>> ' '.join(k for k, _ in itertools.groupby(mystring.split()))
"my friend's new and old cats are running in the street"
  • mystring.split() splits the mystring.
  • itertools.groupby efficiently groups the consecutive words by k.
  • Using list comprehension, we just take the group key.
  • We join using a space.

The complexity is linear in the size of the input string.

like image 27
Ami Tavory Avatar answered Oct 07 '22 00:10

Ami Tavory