Update: This issue was resolved by the developer in commit be893e9
If you encounter the same problem, update your regex
module.
You need version 2017.04.23
or above.
As pointed out in this answer I need this regular expression:
(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})
working with the regex
module too...
import re # standard library
import regex # https://pypi.python.org/pypi/regex/
content = '"Erm....yes. T..T...Thank you for that."'
pattern = r"(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
substitute = r"\2-\4"
print(re.sub(pattern, substitute, content))
print(regex.sub(pattern, substitute, content))
Output:
"Erm....yes. T-Thank you for that."
"-yes. T..T...Thank you for that."
Q: How do I have to write this regex to make the regex
module react to it the same way the re
module does?
Using the re
module is not an option as I require look-behinds with dynamic lengths.
For clarification: it would be nice if the regex would work with both modules but in the end I only need it for regex
There is a difference between the use of both functions. Both return the first match of a substring found in the string, but re. match() searches only from the beginning of the string and return match object if found. But if a match of substring is found somewhere in the middle of the string, it returns none.
A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
Python re. match() method looks for the regex pattern only at the beginning of the target string and returns match object if match found; otherwise, it will return None.
RegEx Module Python has a built-in package called re , which can be used to work with Regular Expressions.
It seems that this bug is related to backtracking. It occurs when a capture group is repeated, and the capture group matches but the pattern after the group doesn't.
An example:
>>> regex.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'5'
For reference, the expected output would be:
>>> re.sub(r'(?:(\d{1,3})x)+', r'\1', '123x5')
'1235'
In the first iteration, the capture group (\d{1,3})
consumes the first 3 digits, and x
consumes the following "x" character. Then, because of the +
, the match is attempted a 2nd time. This time, (\d{1,3})
matches "5", but the x
fails to match. However, the capture group's value is now (re)set to the empty string instead of the expected 123
.
As a workaround, we can prevent the capture group from matching. In this case, changing it to (\d{2,3})
is enough to bypass the bug (because it no longer matches "5"):
>>> regex.sub(r'(?:(\d{2,3})x)+', r'\1', '123x5')
'1235'
As for the pattern in question, we can use a lookahead assertion; we change (\w{1,3})
to (?=\w{1,3}(?:-|\.\.))(\w{1,3})
:
>>> pattern= r"(?i)\b((?=\w{1,3}(?:-|\.\.))(\w{1,3})(-|\.{2,10})[\t ]?)+(\2\w{2,})"
>>> regex.sub(pattern, substitute, content)
'"Erm....yes. T-Thank you for that."'
edit: the bug is now resolved in regex 2017.04.23
just tested in Python 3.6.1 and the original pattern works the same in re
and regex
Original workaround - you can use a lazy operator +?
(i.e. a different regex that will behave differently than original pattern in edge cases like T...Tha....Thank
):
pattern = r"(?i)\b((\w{1,3})(-|\.{2,10})[\t ]?)+?(\2\w{2,})"
The unsuccessful longer match creates empty \2
group and conceptually, it should trigger backtracking to shorter match, where the nested group will be not empty, but regex
seems to "optimize" and does not compute the shorter match from scratch, but uses some cached values, forgetting to undo the update of nested match groups.
Example greedy matching ((\w{1,3})(\.{2,10})){1,3}
will first attempt 3 repetitions, then backtracks to less:
import re
import regex
content = '"Erm....yes. T..T...Thank you for that."'
base_pattern_template = r'((\w{1,3})(\.{2,10})){%s}'
test_cases = ['1,3', '3', '2', '1']
for tc in test_cases:
pattern = base_pattern_template % tc
expected = re.findall(pattern, content)
actual = regex.findall(pattern, content)
# TODO: convert to test case, e.g. in pytest
# assert str(expected) == str(actual), '{}\nexpected: {}\nactual: {}'.format(tc, expected, actual)
print('expected:', tc, expected)
print('actual: ', tc, actual)
output:
expected: 1,3 [('Erm....', 'Erm', '....'), ('T...', 'T', '...')]
actual: 1,3 [('Erm....', '', '....'), ('T...', '', '...')]
expected: 3 []
actual: 3 []
expected: 2 [('T...', 'T', '...')]
actual: 2 [('T...', 'T', '...')]
expected: 1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]
actual: 1 [('Erm....', 'Erm', '....'), ('T..', 'T', '..'), ('T...', 'T', '...')]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With