I am writing a function to split numbers and some other things from text in python. The code looks something like this:
EN_EXTRACT_REGEX = '([a-zA-Z]+)'
NUM_EXTRACT_REGEX = '([0-9]+)'
AGGR_REGEX = EN_EXTRACT_REGEX + '|' + NUM_EXTRACT_REGEX
entry = re.sub(AGGR_REGEX, r' \1\2', entry)
Now, this code works perfectly fine in python3, but it does not work under python2 and get an "unmatched group" error.
The problem is, I need to support both versions, and I could not get it to work properly in python2 although I tried various other ways.
I am curious what could be the root of this problem, and is there any workaround for it?
I think that the problem might be that the regex pattern matches one or the other of the subpatterns EN_EXTRACT_REGEX
and NUM_EXTRACT_REGEX
, but not both.
When re.sub()
matches the alpha characters in the first pattern it attempts to substitute the second group reference with \2
which fails because only the first group matched - there is no second group.
Similarly when the digit pattern is matched there is no \1
group to substitute and so this also fails.
You can see that this is the case with this test in Python 2:
>>> re.sub(AGGR_REGEX, r' \1', 'abcd') # reference first pattern
abcd
>>> re.sub(AGGR_REGEX, r' \2', 'abcd') # reference second pattern
Traceback (most recent call last):
....
sre_constants.error: unmatched group
The difference must lie within the different versions of the regex engine for Python 2 and Python 3. Unfortunately I can not provide a definitive reason for the difference, however, there is a documented change in version 3.5 for re.sub()
regarding unmatched groups:
Changed in version 3.5: Unmatched groups are replaced with an empty string.
which explains why it works in Python >= 3.5 but not in earlier versions: unmatched groups are basically ignored.
As a workaround you can change your pattern to handle both matches as a single group:
import re
EN_EXTRACT_REGEX = '[a-zA-Z]+'
NUM_EXTRACT_REGEX = '[0-9]+'
AGGR_REGEX = '(' + EN_EXTRACT_REGEX + '|' + NUM_EXTRACT_REGEX + ')'
# ([a-zA-Z]+|[0-9]+)
for s in '', '1234', 'abcd', 'a1b2c3', 'aa__bb__1122cdef', '_**_':
print(re.sub(AGGR_REGEX, r' \1', s))
Output
1234 abcd a 1 b 2 c 3 aa__ bb__ 1122 cdef _**_
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With