Why does (.*)* make two matches and select nothing in group $1?

Question

This arose from a discussion on formalizing regular expressions syntax. I've seen this behavior with several regular expression parsers, hence I tagged it language-agnostic.

Take the following expression (adjust it for your favorite language):

replace("input", "(.*)*", "$1")

it will return an empty string. Why?

More curiously even, the expression replace("input", "(.*)*", "A$1B") will return the string ABAB. Why the double empty match?

Disclaimer: I know about backtracking and greedy matches, but the rules laid out by Jeffrey Friedl seem to dictate that .* matches everything and that no further backtracking or matching is done. Then why is $1 empty?

Note: compare with (.+)*, which returns the input string. However, http://regexhero.com shows that there are still two matches, which seems odd for the same reasons as above.

Tim Pietzcker · Accepted Answer

Let's see what happens:

(.*) matches "input".
"input" is captured into group 1.
The regex engine is now positioned at the end of the string. But since (.*) is repeated, another match attempt is made:
(.*) matches the empty string after "input".
The empty string is captured into group 1, overwriting "input".
$1 now contains the empty string.

A good question from the comments:

Then why does replace("input", "(input)*", "A$1B") return "AinputBAB"?

(input)* matches "input". It is replaced by "AinputB".
(input)* matches the empty string. It is replaced by "AB" ($1 is empty because it didn't participate in the match).
Result: "AinputBAB"

Why does (.) make two matches and select nothing in group $1?

Tags:

Abel

1 Answers

Tim Pietzcker

Recent Activity

Donate For Us