Many regex engines match .*
twice in a single-line string, e.g., when performing regex-based string replacement:
In many engines there is a 2nd match, namely the empty string; that is, even though the 1st match has consumed the entire input string, .*
is matched again, which then matches the empty string at the end of the input string.
^.*
My questions are:
Is there a good reason for this behavior? Once the input string has been consumed in full, I wouldn't expect another attempt to find a match.
Other than trial and error, can you glean from the documentation / regex dialect/standard supported which engines exhibit this behavior?
Update: revo's helpful answer explains the how of the current behavior; as for the potential why, see this related question.
Languages/platforms that DO exhibit the behavior:
# .NET, via PowerShell (behavior also applies to the -replace operator)
PS> [regex]::Replace('a', '.*', '[$&]'
[a][] # !! Note the *2* matches, first the whole string, then the empty string
# Node.js
$ node -pe "'a'.replace(/.*/g, '[$&]')"
[a][]
# Ruby
$ ruby -e "puts 'a'.gsub(/.*/, '[\\0]')"
[a][]
# Python 3.7+ only
$ python -c "import re; print(re.sub('.*', '[\g<0>]', 'a'))"
[a][]
# Perl 5
$ echo a | perl -ple 's/.*/[$&]/g'
[a][]
# Perl 6
$ echo 'a' | perl6 -pe 's:g/.*/[$/]/'
[a][]
# Others?
Languages/platforms that do NOT exhibit the behavior:
# Python 2.x and Python 3.x <= 3.6
$ python -c "import re; print(re.sub('.*', '[\g<0>]', 'a'))"
[a] # !! Only 1 match found.
# Others?
bobble bubble brings up some good related points:
If you make it lazy like
.*?
, you'd even get 3 matches in some and 2 matches in others. Same with.??
. As soon as we use a start anchor, I thought we should get only one match, but interestingly it seems^.*?
gives two matches in PCRE fora
, whereas^.*
should result in one match everywhere.
Here's a PowerShell snippet for testing the behavior across languages, with multiple regexes:
Note: Assumes that Python 3.x is available as python3
and Perl 6 as perl6
.
You can paste the whole snippet directly on the command line and recall it from the history to modify the inputs.
& {
param($inputStr, $regexes)
# Define the commands as script blocks.
# IMPORTANT: Make sure that $inputStr and $regex are referenced *inside "..."*
# Always use "..." as the outer quoting, to work around PS quirks.
$cmds = { [regex]::Replace("$inputStr", "$regex", '[$&]') },
{ node -pe "'$inputStr'.replace(/$regex/g, '[$&]')" },
{ ruby -e "puts '$inputStr'.gsub(/$regex/, '[\\0]')" },
{ python -c "import re; print(re.sub('$regex', '[\g<0>]', '$inputStr'))" },
{ python3 -c "import re; print(re.sub('$regex', '[\g<0>]', '$inputStr'))" },
{ "$inputStr" | perl -ple "s/$regex/[$&]/g" },
{ "$inputStr" | perl6 -pe "s:g/$regex/[$/]/" }
$regexes | foreach {
$regex = $_
Write-Verbose -vb "----------- '$regex'"
$cmds | foreach {
$cmd = $_.ToString().Trim()
Write-Verbose -vb ('{0,-10}: {1}' -f (($cmd -split '\|')[-1].Trim() -split '[ :]')[0],
$cmd -replace '\$inputStr\b', $inputStr -replace '\$regex\b', $regex)
& $_ $regex
}
}
} -inputStr 'a' -regexes '.*', '^.*', '.*$', '^.*$', '.*?'
Sample output for regex ^.*
, which confirms bobble bubble's expectation that using the start anchor (^
) yields only one match in all languages:
VERBOSE: ----------- '^.*'
VERBOSE: [regex] : [regex]::Replace("a", "^.*", '[$&]')
[a]
VERBOSE: node : node -pe "'a'.replace(/^.*/g, '[$&]')"
[a]
VERBOSE: ruby : ruby -e "puts 'a'.gsub(/^.*/, '[\\0]')"
[a]
VERBOSE: python : python -c "import re; print(re.sub('^.*', '[\g<0>]', 'a'))"
[a]
VERBOSE: python3 : python3 -c "import re; print(re.sub('^.*', '[\g<0>]', 'a'))"
[a]
VERBOSE: perl : "a" | perl -ple "s/^.*/[$&]/g"
[a]
VERBOSE: perl6 : "a" | perl6 -pe "s:g/^.*/[$/]/"
[a]
match() function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object.
The match method returns a corresponding match object instance if zero or more characters at the beginning of the string match the regular expression pattern. In simple words, the re. match returns a match object only if the pattern is located at the beginning of the string; otherwise, it will return None.
Kinda interesting question. Instead of referring to your questions first, I'll go for your comment.
Once the input string has been consumed in full, why would you treat the fact that there is nothing left as the empty string?
A position called end of subject string is left. It's a position and can be matched. Like other zero-width assertions and anchors \b
, \B
, ^
, $
... that assert, a dot-star .*
can match an empty string. This is highly dependent on regex engine. E.g. TRegEx does it differently.
And if you do, shouldn't this result in an infinite loop?
No, this is of the main jobs of regex engines to handle. They raise a flag and store current cursor data to avoid such loops to occur. Perl docs explain it this way:
A common abuse of this power stems from the ability to make infinite loops using regular expressions, with something as innocuous as:
'foo' =~ m{ ( o? )* }x;
The
o?
matches at the beginning offoo
, and since the position in the string is not moved by the match,o?
would match again and again because of the*
quantifier. Another common way to create a similar cycle is with the looping modifier/g
...Thus Perl allows such constructs, by forcefully breaking the infinite loop. The rules for this are different for lower-level loops given by the greedy quantifiers
*+{}
, and for higher-level ones like the/g
modifier orsplit()
operator.The lower-level loops are interrupted (that is, the loop is broken) when Perl detects that a repeated expression matched a zero-length substring.
Now back to your questions:
Is there a good reason for this behavior?
Yes, there is. Every regex engine has to meet a significant amount of challenges in order to process a text. One of which is dealing with zero-length matches. Your question raises another question,
Q: How does an engine should proceed after matching a zero-length string?
A: It all depends.
It matches it then raises a flag to not match the same position again with the (same)? pattern. In PCRE .*
matches entire subject string then stops right after it. Being at the end, current position is a meaningful position in PCRE, positions can be matched or being asserted so there is a position (zero-length string) left to be matched. PCRE goes through the regex again (if g
modifier is enabled) and finds a match at the end of subject.
PCRE then tries to advance to the next immediate position to run whole process again but it fails since there is no position left.
You see if you want to prevent the second match from being happened you need to tell engine in some way:
^.*
Or to provide a better insight into what is going on:
(?!$).*
See live demo here specially take a look at debugger window.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With