Note: * Python is used to illustrate behaviors, but this question is language-agnostic. * For the purpose of this discussion, assume single-line input only, because the presence of newlines (multi-line input) introduces variations in behavior of <code>$</code> and <code>.</code> that are incidental to the questions at hand. Most regex engines: <ul> <li> accept a regex that explicitly tries to match an expression after the end of the input string[1]. <pre class="prettyprint"><code>$ python -c "import re; print(re.findall('$.*', 'a'))" [''] # !! Matched the hypothetical empty string after the end of 'a' </code></pre> </li> <li> when finding / replacing globally, i.e., when looking for all non-overlapping matches of a given regex, and having reached the end of the string, unexpectedly try to match again[2], as explained in this answer to a related question: <pre class="prettyprint"><code>$ python -c "import re; print(re.findall('.*$', 'a'))" ['a', ''] # !! Matched both the full input AND the hypothetical empty string </code></pre> </li> </ul> Perhaps needless to say, such match attempts succeed only if the regex in question matches the empty string (and the regex by default / is configured to report zero-length matches). These behaviors are at least at first glance counter-intuitive, and I wonder if someone can provide a design rationale for them, not least because: <ul> <li>it's not obvious what the benefit of this behavior is.</li> <li>conversely, in the context of finding / replacing globally with patterns such as <code>.*</code> and <code>.*$</code>, the behavior is downright surprising.[3]<ul> <li>To ask the question more pointedly: Why does functionality designed to find multiple, non-overlapping matches of a regex - i.e., global matching - decide to even attempt another match if it knows that the entire input has been consumed already, irrespective of what the regex is (although you'll never see the symptom with a regex that doesn't at least also match the empty string)</li> <li>The following languages/engines exhibit the surprising behavior: .NET, Python (both 2.x and 3.x)[2], Perl (both 5.x and 6.x), Ruby, Node.js (JavaScript)</li> </ul> </li> </ul> <hr> Note that regex engines vary in behavior with respect to where to continue matching after a zero-length (empty-string) match. Either choice (start at the same character position vs. start at the next) is defensible - see the chapter on zero-length matches at www.regular-expressions.info. By contrast, the <code>.*$</code> case discussed here is different in that, with any non-empty input, the first match for <code>.*$</code> is not a zero-length match, so the behavior difference does not apply - instead, the character position should advance unconditionally after the first match, which of course is impossible if you're already at the end. Again, my surprise is at the fact that another match is attempted nonetheless, even though there's by definition nothing left. <hr> [1] I'm using <code>$</code> as the end-of-input marker here, even though in some engines, such as .NET's, it can mark the end the end of the input optionally followed by a trailing newline. However, the behavior equally applies when you use the unconditional end-of-input marker, <code>\z</code>. [2] Python 2.x and 3.x up to 3.6.x seemingly special-cased replacement behavior in this context: <code>python -c "import re; print(re.sub('.*$', '[\g<0>]', 'a'))"</code> used to yield just <code>[a]</code> - that is, only one match was found and replaced. Since Python 3.7, the behavior is now like in most other regex engines, where two replacements are performed, yielding <code>[a][]</code>. [3] You can avoid the problem by either (a) choosing a replacement method that is designed to find at most one match or (b) use <code>^.*</code> to prevent multiple matches from being found via start-of-input anchoring. (a) may not be an option, depending on how a given language surfaces functionality; for instance, PowerShell's <code>-replace</code> operator invariably replaces all occurrences; consider the following attempt to enclose all array elements in <code>"..."</code>: <code>'a', 'b' -replace '.*', '"$&"'</code>. Due to matching twice, this yields elements <code>"a"""</code> and <code>"b"""</code>; option (b), <code>'a', 'b' -replace '^.*', '"$&"'</code>, fixes the problem.

I am giving this answer just to demonstrate why a regex would want to allow any code appearing after the final <code>$</code> anchor in the pattern. Suppose we needed to create a regex to match a string with the following rules: <ul> <li>starts with three numbers</li> <li>followed by one or more letters, numbers, hyphen, or underscore</li> <li>ends with only letters and numbers</li> </ul> We could write the following pattern: <pre class="prettyprint"><code>^\d{3}[A-Za-z0-9\-_]*[A-Za-z0-9]$ </code></pre> But this is a bit bulky, because we have to use two similar character classes adjacent to each other. Instead, we could write the pattern as: <pre class="prettyprint"><code>^\d{3}[A-Za-z0-9\-_]+$(?<!_|-) </code></pre> or <pre class="prettyprint"><code>^\d{3}[A-Za-z0-9\-_]+(?<!_|-)$ </code></pre> Here, we eliminated one of the character classes, and instead used a negative lookbehind after the <code>$</code> anchor to assert that the final character was not underscore or hyphen. Other than a lookbehind, it makes no sense to me why a regex engine would allow something to appear after the <code>$</code> anchor. My point here is that a regex engine may allow a lookbehind to appear after the <code>$</code>, and there are cases for which it logically makes sense to do so.

Why do regex engines allow / automatically attempt matching at the end of the input string?

Tags:

^{Note:
* Python is used to illustrate behaviors, but this question is language-agnostic.
* For the purpose of this discussion, assume single-line input only, because the presence of newlines (multi-line input) introduces variations in behavior of $ and . that are incidental to the questions at hand.}

Most regex engines:

accept a regex that explicitly tries to match an expression after the end of the input string^[1].

$ python -c "import re; print(re.findall('$.*', 'a'))" [''] # !! Matched the hypothetical empty string after the end of 'a'

when finding / replacing globally, i.e., when looking for all non-overlapping matches of a given regex, and having reached the end of the string, unexpectedly try to match again^[2], as explained in this answer to a related question:
```
$ python -c "import re; print(re.findall('.*$', 'a'))" ['a', ''] # !! Matched both the full input AND the hypothetical empty string 
```

Perhaps needless to say, such match attempts succeed only if the regex in question matches the empty string (and the regex by default / is configured to report zero-length matches).

These behaviors are at least at first glance counter-intuitive, and I wonder if someone can provide a design rationale for them, not least because:

it's not obvious what the benefit of this behavior is.
conversely, in the context of finding / replacing globally with patterns such as .* and .*$, the behavior is downright surprising.^[3]
- To ask the question more pointedly: Why does functionality designed to find multiple, non-overlapping matches of a regex - i.e., global matching - decide to even attempt another match if it knows that the entire input has been consumed already, irrespective of what the regex is (although you'll never see the symptom with a regex that doesn't at least also match the empty string)
- The following languages/engines exhibit the surprising behavior: .NET, Python (both 2.x and 3.x)^[2], Perl (both 5.x and 6.x), Ruby, Node.js (JavaScript)

Note that regex engines vary in behavior with respect to where to continue matching after a zero-length (empty-string) match.

Either choice (start at the same character position vs. start at the next) is defensible - see the chapter on zero-length matches at www.regular-expressions.info.

By contrast, the .*$ case discussed here is different in that, with any non-empty input, the first match for .*$ is not a zero-length match, so the behavior difference does not apply - instead, the character position should advance unconditionally after the first match, which of course is impossible if you're already at the end.
Again, my surprise is at the fact that another match is attempted nonetheless, even though there's by definition nothing left.

^{[1] I'm using $ as the end-of-input marker here, even though in some engines, such as .NET's, it can mark the end the end of the input optionally followed by a trailing newline. However, the behavior equally applies when you use the unconditional end-of-input marker, \z.}

^{[2] Python 2.x and 3.x up to 3.6.x seemingly special-cased replacement behavior in this context: python -c "import re; print(re.sub('.*$', '[\g<0>]', 'a'))" used to yield just [a] - that is, only one match was found and replaced.
Since Python 3.7, the behavior is now like in most other regex engines, where two replacements are performed, yielding [a][].}

^{[3] You can avoid the problem by either (a) choosing a replacement method that is designed to find at most one match or (b) use ^.* to prevent multiple matches from being found via start-of-input anchoring.
(a) may not be an option, depending on how a given language surfaces functionality; for instance, PowerShell's -replace operator invariably replaces all occurrences; consider the following attempt to enclose all array elements in "...":
'a', 'b' -replace '.*', '"$&"'. Due to matching twice, this yields elements "a""" and "b""";
option (b), 'a', 'b' -replace '^.*', '"$&"', fixes the problem.}

695

asked Sep 17 '18 14:09

mklement0

Video Answer

1 Answers

I am giving this answer just to demonstrate why a regex would want to allow any code appearing after the final $ anchor in the pattern. Suppose we needed to create a regex to match a string with the following rules:

starts with three numbers
followed by one or more letters, numbers, hyphen, or underscore
ends with only letters and numbers

We could write the following pattern:

^\d{3}[A-Za-z0-9\-_]*[A-Za-z0-9]$

But this is a bit bulky, because we have to use two similar character classes adjacent to each other. Instead, we could write the pattern as:

^\d{3}[A-Za-z0-9\-_]+$(?<!_|-)

^\d{3}[A-Za-z0-9\-_]+(?<!_|-)$

Here, we eliminated one of the character classes, and instead used a negative lookbehind after the $ anchor to assert that the final character was not underscore or hyphen.

Other than a lookbehind, it makes no sense to me why a regex engine would allow something to appear after the $ anchor. My point here is that a regex engine may allow a lookbehind to appear after the $, and there are cases for which it logically makes sense to do so.

answered Nov 11 '22 00:11

Tim Biegeleisen

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why do regex engines allow / automatically attempt matching at the end of the input string?

Tags:

mklement0

People also ask

Video Answer

1 Answers

Tim Biegeleisen

Recent Activity

Donate For Us

Why do regex engines allow / automatically attempt matching at the end of the input string?

Tags:

mklement0

People also ask

Video Answer

1 Answers

Tim Biegeleisen

Related questions

Recent Activity

Donate For Us