EDIT: remove original example because it provoked ancillary answers. also fixed the title.
The question is why the presence of the "$" in the regular expression effects the greedyness of the expression:
Here is a simpler example:
>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'
The "?" seems to be doing nothing. Note the when the "$" is removed, however, then the "?" is respected:
>>> m = re.search(r"a+?", str)
>>> m.group()
'a'
EDIT: In other words, "a+?$" is matching ALL of the a's instead of just the last one, this is not what I expected. Here is the description of the regex "+?" from the python docs: "Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched."
This does not seem to be the case in this example: the string "a" matches the regex "a+?$", so why isn't the match for the same regex on the string "baaaaaaa" just a single a (the rightmost one)?
Matches are "ordered" by "left-most, then longest"; however "longest" is the term used before non-greedy was allowed, and instead means something like "preferred number of repetitions for each atom". Being left-most is more important than the number of repetitions. Thus, "a+?$" will not match the last A in "baaaaa" because matching at the first A starts earlier in the string.
(Answer changed after OP clarification in comments. See history for previous text.)
The non-greedy modifier only affects where the match stops, never where it starts. If you want to start the match as late as possible, you will have to add .+?
to the beginning of your pattern.
Without the $
, your pattern is allowed to be less greedy and stop sooner, because it doesn't have to match to the end of the string.
EDIT:
More details... In this case:
re.search(r"a+?$", "baaaaaaaa")
the regex engine will ignore everything up until the first 'a', because that's how re.search
works. It will match the first a
, and would "want" to return a match, except it doesn't match the pattern yet because it must reach a match for the $
. So it just keeps eating the a
's one at a time and checking for $
. If it were greedy, it wouldn't check for the $
after each a
, but only after it couldn't match any more a
's.
But in this case:
re.search(r"a+?", "baaaaaaaa")
the regex engine will check if it has a complete match after eating the first match (because it's non-greedy) and succeed because there is no $
in this case.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With