Python look-behind regex "fixed-width pattern" error while looking for consecutive repeated words

Question

I have a text with words separated by ., with instances of 2 and 3 consecutive repeated words:

My.name.name.is.Inigo.Montoya.You.killed.my.father.father.father.Prepare.to.die-

I need to match them independently with regex, excluding the duplicates from the triplicates.

Since there are max. 3 consecutive repeated words, this

r'\b(\w+)\.+\1\.+\1\b'

successfully catches

father.father.father

However, in order to catch 2 consecutive repeated words, I need to make sure the next and previous words aren't the same. I can do a negative look-ahead

r'\b(\w+)\.+\1(?!\.+\1)\b'

but my attempts at the negative look-behind

r'(?<!(\w)\.)\b\1\.+\1\b(?!\.\1)'

either return a fixed-width issue (when I keep the +) or some other issue.

How should I correct the negative look-behind?

joaoricardo000 · Accepted Answer

I think that there might be an easier way to capture what you want without the negative look-behind:

r = re.compile(r'\b((\w+)\.+\2\.+\2?)\b')
r.findall(t)

> [('name.name.', 'name'), ('father.father.father', 'father')]

Just making the third repetition optional.

A version to capture any number of repetitions of the same word, can look something like this:

r = re.compile(r'\b((\w+)(\.+\2)\3*)\b')
r.findall(t)
> [('name.name', 'name', '.name'), ('father.father.father', 'father', '.father')]

Jean-François Fabre · Answer

Maybe regexes are not needed at all.

Using itertools.groupby does the job. It's designed to group equal occurrences of consecutive items.

group by words (after splitting according to dots)
convert to list and issue a tuple value,count only if length > 1

like this:

import itertools

s = "My.name.name.is.Inigo.Montoya.You.killed.my.father.father.father.Prepare.to.die"

matches = [(l[0],len(l)) for l in (list(v) for k,v in itertools.groupby(s.split("."))) if len(l)>1]

result:

[('name', 2), ('father', 3)]

So basically we can do whatever we want with this list of tuples (filtering it on the number of occurrences for instance)

Bonus (as I misread the question at first, so I'm leaving it in): to remove the duplicates from the sentence - group by words (after splitting according to dots) like above - take only key (value) of the values returned in a list comp (we don't need the values since we don't count) - join back with dot

In one line (still using itertools):

new_s = ".".join([k for k,_ in itertools.groupby(s.split("."))])

result:

My.name.is.Inigo.Montoya.You.killed.my.father.Prepare.to.die

Python look-behind regex "fixed-width pattern" error while looking for consecutive repeated words

Tags:

python

regex

regex-lookarounds

negative-lookahead

nacho

2 Answers

joaoricardo000

Jean-François Fabre

Recent Activity

Donate For Us

Python look-behind regex "fixed-width pattern" error while looking for consecutive repeated words

Tags:

python

regex

regex-lookarounds

negative-lookahead

nacho

2 Answers

joaoricardo000

Jean-François Fabre

Related questions

Recent Activity

Donate For Us