Python regex module vs re module - pattern mismatch

Tags:

regex

Update: This issue was resolved by the developer in commit be893e9

If you encounter the same problem, update your regex module.
You need version 2017.04.23 or above.

As pointed out in this answer I need this regular expression:

(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})

working with the regex module too...

import re     # standard library
import regex  # https://pypi.python.org/pypi/regex/

content = '"Erm....yes. T..T...Thank you for that."'
pattern = r"(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
substitute = r"\2-\4"

print(re.sub(pattern, substitute, content))
print(regex.sub(pattern, substitute, content))

Output:

"Erm....yes. T-Thank you for that."
"-yes. T..T...Thank you for that."

Q: How do I have to write this regex to make the regex module react to it the same way the re module does?

Using the re module is not an option as I require look-behinds with dynamic lengths.

For clarification: it would be nice if the regex would work with both modules but in the end I only need it for regex

745

asked Apr 22 '17 19:04

2 Answers

It seems that this bug is related to backtracking. It occurs when a capture group is repeated, and the capture group matches but the pattern after the group doesn't.

An example:

>>> regex.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'5'

For reference, the expected output would be:

>>> re.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'1235'

In the first iteration, the capture group (\d{1,3}) consumes the first 3 digits, and x consumes the following "x" character. Then, because of the +, the match is attempted a 2nd time. This time, (\d{1,3}) matches "5", but the x fails to match. However, the capture group's value is now (re)set to the empty string instead of the expected 123.

As a workaround, we can prevent the capture group from matching. In this case, changing it to (\d{2,3}) is enough to bypass the bug (because it no longer matches "5"):

>>> regex.sub(r'(?:(\d{2,3})x)+', r'\1', '123x5')
'1235'

As for the pattern in question, we can use a lookahead assertion; we change (\w{1,3}) to (?=\w{1,3}(?:-|\.\.))(\w{1,3}):

>>> pattern= r"(?i)\b((?=\w{1,3}(?:-|\.\.))(\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
>>> regex.sub(pattern, substitute, content)
'"Erm....yes. T-Thank you for that."'

111

answered Oct 25 '22 20:10

Aran-Fey

edit: the bug is now resolved in regex 2017.04.23

just tested in Python 3.6.1 and the original pattern works the same in re and regex

Original workaround - you can use a lazy operator +? (i.e. a different regex that will behave differently than original pattern in edge cases like T...Tha....Thank):

pattern = r"(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+?(\2\w{2,})"

The bug in 2017.04.05 was due to backtracking, something like this:

The unsuccessful longer match creates empty \2 group and conceptually, it should trigger backtracking to shorter match, where the nested group will be not empty, but regex seems to "optimize" and does not compute the shorter match from scratch, but uses some cached values, forgetting to undo the update of nested match groups.

Example greedy matching ((\w{1,3})(\.{2,10})){1,3} will first attempt 3 repetitions, then backtracks to less:

import re
import regex

content = '"Erm....yes. T..T...Thank you for that."'
base_pattern_template = r'((\w{1,3})(\.{2,10})){%s}'
test_cases = ['1,3', '3', '2', '1']

for tc in test_cases:
    pattern = base_pattern_template % tc
    expected = re.findall(pattern, content)
    actual = regex.findall(pattern, content)
    # TODO: convert to test case, e.g. in pytest
    # assert str(expected) == str(actual), '{}\nexpected: {}\nactual: {}'.format(tc, expected, actual)
    print('expected:', tc, expected)
    print('actual:  ', tc, actual)

output:

expected: 1,3 [('Erm....', 'Erm', '....'), ('T...', 'T', '...')]
actual:   1,3 [('Erm....', '', '....'), ('T...', '', '...')]
expected: 3 []
actual:   3 []
expected: 2 [('T...', 'T', '...')]
actual:   2 [('T...', 'T', '...')]
expected: 1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]
actual:   1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]

answered Oct 25 '22 20:10

Aprillion

Related questions
                            
                                How to play streaming audio using pyglet?
                            
                                concurrent.futures.ThreadPoolExecutor.map(): timeout not working
                            
                                pytorch skip connection in a sequential model
                            
                                PIL error: The _imaging C module is not installed
                            
                                Python: 'object in list' checks and '__cmp__' overflow
                            
                                Good ways to name django projects which contain only one app
                            
                                When should I use `wait` instead of `communicate` in subprocess?
                            
                                Eliminate white edges in Matplotlib/Basemap pcolor plot
                            
                                call C++ using Eigen Library function in python
                            
                                Julia Dataframes vs Python pandas
                            
                                Force Return of "View" rather than copy in Pandas?
                            
                                Is it possible to wrap a function from a shared library using F2PY?
                            
                                multiprocessing.Pool hangs if child causes a segmentation fault
                            
                                SerializerClass field on Serializer save from primary key
                            
                                Python: Feed and parse stream of data to and from external program with additional input and output files
                            
                                Diffie-Hellman (to RC4) with Wincrypt From Python
                            
                                Access Azure blob storage from within an Azure ML experiment
                            
                                Equivalent of source() of R in Python
                            
                                Arranging letters in the most pronounceable way?
                            
                                Debugging in Python: Show last N executed lines

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python regex module vs re module - pattern mismatch

Tags:

python

regex

Fabian N.

People also ask

2 Answers

Aran-Fey

Aprillion

Recent Activity

Donate For Us