Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kleene's Star: why does $_ = "a"; s/a*/e/g produce: ee

Tags:

regex

perl

a* means zero or more instances of: a right?

so why does $_ = "a"; s/a*/e/g produce: ee

Possible answer: it's replacing the string: "a" with: "e" and it's replacing the empty string: "" with: "e" as well. Or it's replacing the mere absence of a letter: a with a letter: e or it's replacing "zero occurrences" of: a with an: e

Ok then, but:

$_ = "b"; s/a*/e/g produces: ebe

It seems to be replacing the empty string to the left of: b and also the empty string to the right of: b

OK. But then why doesn't it do that for: "a" ? Why doesn't it replace the empty string to the left of: a and also the empty string to the right of: a and also the letter: a itself to get: eee ?

There are just as many zero occurrences of: a on the left side as the right side!

like image 452
Literat Avatar asked Aug 07 '12 23:08

Literat


2 Answers

Your analysis of why the results are "ee" and "ebe" is completely accurate.

The "/g" modifier causes the regex to match once, and then try to match again from where the last match stopped.

The reason for the discrepancy (it doesn't replace the empty string to the left of "a") is that is because "*" is greedy - it matches the MOST possible characters. From perldoc perlre :

By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match.

So it matches zero "a"s, and sees if it can match more. Since there are more "a"s in the string, it will match one more. Try to match more. None? Done. So we match the first "a".

Then, "/g" causes us to try to match again (starting from where we stopped after last match completed), which now matches empty (zero "a"s) string.

like image 60
DVK Avatar answered Sep 22 '22 02:09

DVK


Using Damian Conway's excellent Regexp::Debugger, I tried this:

perl -MRegexp::Debugger -E '$_ = "a"; s/a*/e/g; say' 

And got this output, in case it makes things any clearer, shown in event logging mode. The first pass match running through the replacement yields this set of events:

a               | a*              |   Starting regex match a               | a*              |     Trying a literal character zero-or-more times (as many as possible)                 | a*              |     Matched                 |                 |   Regex matched in 3 steps 

This is showing that the "a" is matched the first time, which gets replaced by "e".

After completing the match the first time, the debugger lets me run a second match from the same program:

                | <~~             |   Back-tracking in regex                 | a*              |   Back-tracked and restarting regex match                 | a*              |     Trying a literal character zero-or-more times (as many as possible)                 | a*              |     Matched                 |                 |   Regex matched in 3 steps 

This is showing that the "" after the original "a" (now "e") is matched the second time and replaced with "e".

Unfortunately, either I don't know how to read the output or Regexp::Debugger gets confused at this point or something, but it repeats again, but doesn't do a replacement.

                | <~~             |   Back-tracking in regex                 | a*              |   Back-tracked and restarting regex match                 | a*              |     Trying a literal character zero-or-more times (as many as possible)                 | a*              |     Matched                 |                 |   Regex matched in 3 steps 

Anyway, either Perl has matched a third time and decides for some reason not to do a replacement this time or Regexp::Debugger or I am just confused.

Edit: I solved my confusion by reviewing perldoc perlre:

"The higher-level loops preserve an additional state between iterations: whether the last match was zero-length. To break the loop, the following match after a zero-length match is prohibited to have a length of zero. This prohibition interacts with backtracking (see "Backtracking"), and so the second best match is chosen if the best match is of zero length."

like image 36
zostay Avatar answered Sep 25 '22 02:09

zostay