Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does {m,n}? regex actually minimize repetitions or does it minimize number of characters matched?

According to the Python3 Regex documentation:

{m,n}?

Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.

However, this seems to be contradicted by the following experiment:

import re
regex = re.compile('(abc|d|abcde){1,2}?(e|f)')
regex.match('abcdef')

... which matches 'abcde'. This necessarily involves 2 repetitions of (abc|d|abcde), namely, 'abc' and 'd'. However, there was an alternative match candidate that involved only 1 repetition of (abc|d|abcde), namely 'abcde'.

Am I misreading the documentation, or does {m,n}? actually minimize the number of characters matched (or some other objective), and not the number of repetitions?

like image 278
James Shapiro Avatar asked Apr 21 '19 07:04

James Shapiro


2 Answers

{m,n}? tries to match as few times as possible, but it's not going to reach into the abc|d|abcde and change the behavior of |. | still tries the left option first.

(abc|d|abcde){1,2}? tries to match (abc|d|abcde) once, and succeeds, matching abc. The regex engine then continues on with the rest of the pattern and tries to match (e|f), which fails. It backtracks and tries to match another repetition of abc|d|abcde, and matches d. It continues on to (e|f) again and matches e successfully.

Perhaps you expected the backtracking to try a different option for the first (abc|d|abcde) before trying a second (abc|d|abcde). It doesn't do that. It would try that eventually, if it had to, but trying more matches for the {1,2}? comes first.

like image 151
user2357112 supports Monica Avatar answered Oct 18 '22 21:10

user2357112 supports Monica


It doesn't force the minimum number of repetitions, it just allows it to match fewer times.

When you have multiple alternatives that can match, regexp engines handle things differently. Some are "eager" and use the first matching alternatives, others use the "longest match" rule. Apparently Python is eager. So (abc|d|abcde) will match abc rather than abcde, and then the next repetition will match e. It doesn't backtrack to see if theres a result with fewer repetitions.

The solution is to put the longer alternatives first. Change it to

regex = re.compile('(abcde|abc|d){1,2}?(e|f)')
like image 24
Barmar Avatar answered Oct 18 '22 20:10

Barmar