This arose from a discussion on formalizing regular expressions syntax. I've seen this behavior with several regular expression parsers, hence I tagged it language-agnostic.
Take the following expression (adjust it for your favorite language):
replace("input", "(.*)*", "$1")
it will return an empty string. Why?
More curiously even, the expression replace("input", "(.*)*", "A$1B")
will return the string ABAB
. Why the double empty match?
Disclaimer: I know about backtracking and greedy matches, but the rules laid out by Jeffrey Friedl seem to dictate that .*
matches everything and that no further backtracking or matching is done. Then why is $1
empty?
Note: compare with (.+)*
, which returns the input string. However, http://regexhero.com shows that there are still two matches, which seems odd for the same reasons as above.
Let's see what happens:
(.*)
matches "input"
."input"
is captured into group 1
.(.*)
is repeated, another match attempt is made:(.*)
matches the empty string after "input"
.1
, overwriting "input"
.$1
now contains the empty string.A good question from the comments:
Then why does
replace("input", "(input)*", "A$1B")
return"AinputBAB"
?
(input)*
matches "input"
. It is replaced by "AinputB"
.(input)*
matches the empty string. It is replaced by "AB"
($1
is empty because it didn't participate in the match)."AinputBAB"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With