Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the minimal (non-greedy) match affected by the end of string character '$'?

EDIT: remove original example because it provoked ancillary answers. also fixed the title.

The question is why the presence of the "$" in the regular expression effects the greedyness of the expression:

Here is a simpler example:

>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'

The "?" seems to be doing nothing. Note the when the "$" is removed, however, then the "?" is respected:

>>> m = re.search(r"a+?", str)
>>> m.group()
'a'

EDIT: In other words, "a+?$" is matching ALL of the a's instead of just the last one, this is not what I expected. Here is the description of the regex "+?" from the python docs: "Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched."

This does not seem to be the case in this example: the string "a" matches the regex "a+?$", so why isn't the match for the same regex on the string "baaaaaaa" just a single a (the rightmost one)?

like image 907
krumpelstiltskin Avatar asked May 03 '11 23:05

krumpelstiltskin


2 Answers

Matches are "ordered" by "left-most, then longest"; however "longest" is the term used before non-greedy was allowed, and instead means something like "preferred number of repetitions for each atom". Being left-most is more important than the number of repetitions. Thus, "a+?$" will not match the last A in "baaaaa" because matching at the first A starts earlier in the string.

(Answer changed after OP clarification in comments. See history for previous text.)

like image 83
Fred Nurk Avatar answered Oct 21 '22 16:10

Fred Nurk


The non-greedy modifier only affects where the match stops, never where it starts. If you want to start the match as late as possible, you will have to add .+? to the beginning of your pattern.

Without the $, your pattern is allowed to be less greedy and stop sooner, because it doesn't have to match to the end of the string.

EDIT:

More details... In this case:

re.search(r"a+?$", "baaaaaaaa")

the regex engine will ignore everything up until the first 'a', because that's how re.search works. It will match the first a, and would "want" to return a match, except it doesn't match the pattern yet because it must reach a match for the $. So it just keeps eating the a's one at a time and checking for $. If it were greedy, it wouldn't check for the $ after each a, but only after it couldn't match any more a's.

But in this case:

re.search(r"a+?", "baaaaaaaa")

the regex engine will check if it has a complete match after eating the first match (because it's non-greedy) and succeed because there is no $ in this case.

like image 28
Mu Mind Avatar answered Oct 21 '22 17:10

Mu Mind