Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does (.*)* make two matches and select nothing in group $1?

Tags:

This arose from a discussion on formalizing regular expressions syntax. I've seen this behavior with several regular expression parsers, hence I tagged it language-agnostic.

Take the following expression (adjust it for your favorite language):

replace("input", "(.*)*", "$1")

it will return an empty string. Why?

More curiously even, the expression replace("input", "(.*)*", "A$1B") will return the string ABAB. Why the double empty match?

Disclaimer: I know about backtracking and greedy matches, but the rules laid out by Jeffrey Friedl seem to dictate that .* matches everything and that no further backtracking or matching is done. Then why is $1 empty?

Note: compare with (.+)*, which returns the input string. However, http://regexhero.com shows that there are still two matches, which seems odd for the same reasons as above.

like image 804
Abel Avatar asked Jan 24 '13 11:01

Abel


1 Answers

Let's see what happens:

  1. (.*) matches "input".
  2. "input" is captured into group 1.
  3. The regex engine is now positioned at the end of the string. But since (.*) is repeated, another match attempt is made:
  4. (.*) matches the empty string after "input".
  5. The empty string is captured into group 1, overwriting "input".
  6. $1 now contains the empty string.

A good question from the comments:

Then why does replace("input", "(input)*", "A$1B") return "AinputBAB"?

  1. (input)* matches "input". It is replaced by "AinputB".
  2. (input)* matches the empty string. It is replaced by "AB" ($1 is empty because it didn't participate in the match).
  3. Result: "AinputBAB"
like image 200
Tim Pietzcker Avatar answered Nov 13 '22 11:11

Tim Pietzcker