Is there a way to remove duplicate and continuous words/phrases in a string?

Tags:

Is there a way to remove duplicate and continuous words/phrases in a string? E.g.

[in]: foo foo bar bar foo bar

[out]: foo bar foo bar

I have tried this:

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> [i for i,j in zip(s.split(),s.split()[1:]) if i!=j]
['this', 'is', 'a', 'foo', 'bar', 'black', 'sheep', ',', 'have', 'you', 'any', 'wool', 'woo', ',', 'yes', 'sir', 'yes', 'sir', 'three', 'bag', 'woo', 'wu']
>>> " ".join([i for i,j in zip(s.split(),s.split()[1:]) if i!=j]+[s.split()[-1]])
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu'

What happens when it gets a little more complicated and i want to remove phrases (let's say phrases can be made up of up to 5 words)? how can it be done? E.g.

[in]: foo bar foo bar foo bar

[out]: foo bar

Another example:

[in]: this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .

[out]: this is a sentence where phrases duplicate . sentence are not prhases .

958

asked Feb 27 '14 10:02

2 Answers

You can use re module for that.

>>> s = 'foo foo bar bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar'

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar foo bar'

If you want to match any number of consecutive occurrences:

>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
'foo bar'

Edit. An addition for your last example. To do so you'll have to call re.sub while there're duplicate phrases. So:

>>> s = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
>>> while re.search(r'\b(.+)(\s+\1\b)+', s):
...   s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
...
>>> s
'this is a sentence where phrases duplicate'

180

answered Nov 03 '22 01:11

I love itertools. It seems like every time I want to write something, itertools already has it. In this case, groupby takes a list and groups repeated, sequential items from that list into a tuple of (item_value, iterator_of_those_values). Use it here like:

>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> ' '.join(item[0] for item in groupby(s.split()))
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu wool'

So let's extend that a little with a function that returns a list with its duplicated repeated values removed:

from itertools import chain, groupby

def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))

That's great for one-word phrases, but not helpful for longer phrases. What to do? Well, first, we'll want to check for longer phrases by striding over our original phrase:

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return

Now we're cooking! OK. So our strategy here is to first remove all the single-word duplicates. Next, we'll remove the two-word duplicates, starting from offset 0 then 1. After that, three-word duplicates starting at offsets 0, 1, and 2, and so on until we've hit five-word duplicates:

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words

Putting it all together:

from itertools import chain, groupby

def stride(lst, offset, length):
    if offset:
        yield lst[:offset]

    while True:
        yield lst[offset:offset + length]
        offset += length
        if offset >= len(lst):
            return

def dedupe(lst):
    return list(chain(*[item[0] for item in groupby(lst)]))

def cleanse(list_of_words, max_phrase_length):
    for length in range(1, max_phrase_length + 1):
        for offset in range(length):
            list_of_words = dedupe(stride(list_of_words, offset, length))

    return list_of_words

a = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .'

b = 'this is a sentence where phrases duplicate . sentence are not prhases .'

print ' '.join(cleanse(a.split(), 5)) == b

answered Nov 03 '22 01:11

Kirk Strauser

Related questions
                            
                                Save a file depending on the user Python
                            
                                Why does testing `NaN == NaN` not work for dropping from a pandas dataFrame?
                            
                                Get all table data in Django
                            
                                func(*args, **kwargs, x) throwing invalid syntax
                            
                                Find the 2nd highest element
                            
                                Random rounding to integer in Python
                            
                                A better way of setting values in CreateView?
                            
                                Rounding floats to nearest 10th
                            
                                Regex to find last word in a string (Python)
                            
                                Function Returning a NoneType in Python?
                            
                                finding a sum of X numbers within a list (Python)
                            
                                Selecting unique observations in a pandas data frame
                            
                                'int' object is not callable error in python
                            
                                Why items order in a dictionary changed in Python? [duplicate]
                            
                                Importing Python libraries and gracefully handling if they are not availalble
                            
                                Mongodb TTL expires documents early
                            
                                trouble with creating a virtual environment in Windows 8, python 3.3
                            
                                Python, accessing dictionary with wildcards
                            
                                Can I just stop using datetime.date entirely?
                            
                                Extract x and y values from a list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a way to remove duplicate and continuous words/phrases in a string?

Tags:

python

string

regex

alvas

People also ask

2 Answers

sharcashmo

Kirk Strauser

Recent Activity

Donate For Us