Note:
* Python is used to illustrate behaviors, but this question is language-agnostic.
* For the purpose of this discussion, assume single-line input only, because the presence of newlines (multi-line input) introduces variations in behavior of $
and .
that are incidental to the questions at hand.
Most regex engines:
accept a regex that explicitly tries to match an expression after the end of the input string[1].
$ python -c "import re; print(re.findall('$.*', 'a'))" [''] # !! Matched the hypothetical empty string after the end of 'a'
when finding / replacing globally, i.e., when looking for all non-overlapping matches of a given regex, and having reached the end of the string, unexpectedly try to match again[2], as explained in this answer to a related question:
$ python -c "import re; print(re.findall('.*$', 'a'))" ['a', ''] # !! Matched both the full input AND the hypothetical empty string
Perhaps needless to say, such match attempts succeed only if the regex in question matches the empty string (and the regex by default / is configured to report zero-length matches).
These behaviors are at least at first glance counter-intuitive, and I wonder if someone can provide a design rationale for them, not least because:
.*
and .*$
, the behavior is downright surprising.[3]Note that regex engines vary in behavior with respect to where to continue matching after a zero-length (empty-string) match.
Either choice (start at the same character position vs. start at the next) is defensible - see the chapter on zero-length matches at www.regular-expressions.info.
By contrast, the .*$
case discussed here is different in that, with any non-empty input, the first match for .*$
is not a zero-length match, so the behavior difference does not apply - instead, the character position should advance unconditionally after the first match, which of course is impossible if you're already at the end.
Again, my surprise is at the fact that another match is attempted nonetheless, even though there's by definition nothing left.
[1] I'm using $
as the end-of-input marker here, even though in some engines, such as .NET's, it can mark the end the end of the input optionally followed by a trailing newline. However, the behavior equally applies when you use the unconditional end-of-input marker, \z
.
[2] Python 2.x and 3.x up to 3.6.x seemingly special-cased replacement behavior in this context: python -c "import re; print(re.sub('.*$', '[\g<0>]', 'a'))"
used to yield just [a]
- that is, only one match was found and replaced.
Since Python 3.7, the behavior is now like in most other regex engines, where two replacements are performed, yielding [a][]
.
[3] You can avoid the problem by either (a) choosing a replacement method that is designed to find at most one match or (b) use ^.*
to prevent multiple matches from being found via start-of-input anchoring.
(a) may not be an option, depending on how a given language surfaces functionality; for instance, PowerShell's -replace
operator invariably replaces all occurrences; consider the following attempt to enclose all array elements in "..."
:'a', 'b' -replace '.*', '"$&"'
. Due to matching twice, this yields elements "a"""
and "b"""
;
option (b), 'a', 'b' -replace '^.*', '"$&"'
, fixes the problem.
A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters.
The expression \w will match any word character. Word characters include alphanumeric characters ( - , - and - ) and underscores (_). \W matches any non-word character.
Using character sets For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.
The regular expression [A-Z][a-z]* matches any sequence of letters that starts with an uppercase letter and is followed by zero or more lowercase letters.
I am giving this answer just to demonstrate why a regex would want to allow any code appearing after the final $
anchor in the pattern. Suppose we needed to create a regex to match a string with the following rules:
We could write the following pattern:
^\d{3}[A-Za-z0-9\-_]*[A-Za-z0-9]$
But this is a bit bulky, because we have to use two similar character classes adjacent to each other. Instead, we could write the pattern as:
^\d{3}[A-Za-z0-9\-_]+$(?<!_|-)
or
^\d{3}[A-Za-z0-9\-_]+(?<!_|-)$
Here, we eliminated one of the character classes, and instead used a negative lookbehind after the $
anchor to assert that the final character was not underscore or hyphen.
Other than a lookbehind, it makes no sense to me why a regex engine would allow something to appear after the $
anchor. My point here is that a regex engine may allow a lookbehind to appear after the $
, and there are cases for which it logically makes sense to do so.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With