Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - error: look-behind requires fixed-width pattern

I have a string that looks like:

phrase = '5 hampshire road bradford on avon avon dinas powys powys north somerset hampshire avon'

I want to return a new string with certain words removed, only if they are not preceded by certain other words.

For example, the words I want to remove are:

c_out = ["avon", "powys", "somerset","hampshire"]

Only if they do not follow:

c_except = ["on\s","dinas\s"]

Note: There could be multiple instances of words within c_out, and multiple instances of words within c_except.

Individually I tried for 'on\s':

phrase = '5 hampshire road bradford on avon avon dinas powys powys north somerset hampshire avon'

regexp1 = re.compile(r'(?<!on\s)(avon|powys|somerset|hampshire)')
print("1st Result: ", regexp1.sub('', phrase))
1st Result:  '5  road bradford on avon avon dinas   north'

This correctly ignores the first 'avon', as it is preceded by 'on\s', it correctly removes the third 'avon', but it ignores the second 'avon' (which it does not remove).

In the same way, for 'dinas\s':

phrase = '5 hampshire road bradford on avon avon dinas powys powys north somerset hampshire avon'

regexp2 = re.compile(r'(?<!dinas\s)(avon|powys|somerset|hampshire)')
print("2nd Result: ", regexp2.sub('', phrase))
2nd Result:  '5  road bradford on   dinas powys  north '

This correctly ignores the first 'powys' and removes the second (note the double space between '... powys north'.

I tried to combine the two expressions by doing:

regexp3 = re.compile(r'((?!on\s)|(?!dinas\s))(avon|powys|somerset|hampshire)')
print("3rd Result: ", regexp3.sub('', phrase))
3rd Result:  5  road bradford on   dinas   north

This incorrectly removed every word, and completely ignored 'on\s' or 'dinas\s'.

Then I tried:

regexp4 = re.compile(r'(?<!on\s|dinas\s)(avon|powys|somerset|hampshire)')
print("4th Result: ", regexp4.sub('', phrase))

And got:

error: look-behind requires fixed-width pattern

I want to end up with:

Result: '5  road bradford on avon dinas powys  north     '

I have had a look at:

Why is this not a fixed width pattern? Python Regex Engine - "look-behind requires fixed-width pattern" Error regex: string with optional parts

But to no avail.

What am I doing wrong?


From comments:

regexp5 = re.compile(r'(?<!on\s)(?<!dinas\s)(avon|powys|somerset|hampshire)')
print("5th Result: ", regexp5.sub('', phrase))
5th Result:  5  road bradford on avon avon dinas powys  north 

Again this misses the second avon.

like image 626
Chuck Avatar asked Mar 09 '23 10:03

Chuck


1 Answers

Here are 2 approaches that will solve the issue:

Chained Lookbehinds

Convert an alternation based lookbehind into several negative lookbehinds since the logical relations between them will be the same (that of AND):

import re
phrase = '5 hampshire road bradford on avon avon dinas powys powys north somerset hampshire avon'
c_except = [r"on\s",r"dinas\s"]
c_out = ["avon", "powys", "somerset","hampshire"]
rx = r"(?<!\b{0})({1})".format(r")(?<!\b".join(c_except), "|".join(c_out))
print(re.sub(rx, "", phrase))

See this Python demo.

Capturing Approch

Capture what you need to keep and match only what you need to remove, and use \1 backreference to restore Group 1 value:

import re
phrase = '5 hampshire road bradford on avon avon dinas powys powys north somerset hampshire avon'
c_except = [r"on\s+",r"dinas\s+"]
c_out = ["avon", "powys", "somerset","hampshire"]
rx = r"(\b(?:{0})(?:{1}))|(?:{1})".format(r"|".join(c_except), "|".join(c_out))
print(re.sub(rx, r"\1", phrase))

See another Python demo.

Note that this approach is favorable since you may use variable width patterns inside c_except.

The regex will look like

(\b(?:on\s+|dinas\s+)(?:avon|powys|somerset|hampshire))|(?:avon|powys|somerset|hampshire)

It will match on or dinas as whole words due to the \b word boundary, and then any of the terms you need to remove and since that part is wrapped into a capturing group, you may refer to the capture with \1 backreference. In all other contexts, the c_out terms will be removed with the |(?:avon|powys|somerset|hampshire) pattern.

NOTE: The \1 replacement will work in Python 3.5+. For older versions, and Python 2.x, you need to replace it with a lambda:

re.sub(rx, lambda m: m.group(1) if m.group(1) else "", phrase)
like image 68
Wiktor Stribiżew Avatar answered Mar 11 '23 01:03

Wiktor Stribiżew