According to the Python3 Regex documentation:
{m,n}?
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.
However, this seems to be contradicted by the following experiment:
import re
regex = re.compile('(abc|d|abcde){1,2}?(e|f)')
regex.match('abcdef')
... which matches 'abcde'. This necessarily involves 2 repetitions of (abc|d|abcde), namely, 'abc' and 'd'. However, there was an alternative match candidate that involved only 1 repetition of (abc|d|abcde), namely 'abcde'.
Am I misreading the documentation, or does {m,n}? actually minimize the number of characters matched (or some other objective), and not the number of repetitions?
{m,n}?
tries to match as few times as possible, but it's not going to reach into the abc|d|abcde
and change the behavior of |
. |
still tries the left option first.
(abc|d|abcde){1,2}?
tries to match (abc|d|abcde)
once, and succeeds, matching abc
. The regex engine then continues on with the rest of the pattern and tries to match (e|f)
, which fails. It backtracks and tries to match another repetition of abc|d|abcde
, and matches d
. It continues on to (e|f)
again and matches e
successfully.
Perhaps you expected the backtracking to try a different option for the first (abc|d|abcde)
before trying a second (abc|d|abcde)
. It doesn't do that. It would try that eventually, if it had to, but trying more matches for the {1,2}?
comes first.
It doesn't force the minimum number of repetitions, it just allows it to match fewer times.
When you have multiple alternatives that can match, regexp engines handle things differently. Some are "eager" and use the first matching alternatives, others use the "longest match" rule. Apparently Python is eager. So (abc|d|abcde)
will match abc
rather than abcde
, and then the next repetition will match e
. It doesn't backtrack to see if theres a result with fewer repetitions.
The solution is to put the longer alternatives first. Change it to
regex = re.compile('(abcde|abc|d){1,2}?(e|f)')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With