Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do regex engines allow / automatically attempt matching at the end of the input string?

Tags:

Note:
* Python is used to illustrate behaviors, but this question is language-agnostic.
* For the purpose of this discussion, assume single-line input only, because the presence of newlines (multi-line input) introduces variations in behavior of $ and . that are incidental to the questions at hand.

Most regex engines:

  • accept a regex that explicitly tries to match an expression after the end of the input string[1].

    $ python -c "import re; print(re.findall('$.*', 'a'))" [''] # !! Matched the hypothetical empty string after the end of 'a' 
  • when finding / replacing globally, i.e., when looking for all non-overlapping matches of a given regex, and having reached the end of the string, unexpectedly try to match again[2], as explained in this answer to a related question:

    $ python -c "import re; print(re.findall('.*$', 'a'))" ['a', ''] # !! Matched both the full input AND the hypothetical empty string 

Perhaps needless to say, such match attempts succeed only if the regex in question matches the empty string (and the regex by default / is configured to report zero-length matches).

These behaviors are at least at first glance counter-intuitive, and I wonder if someone can provide a design rationale for them, not least because:

  • it's not obvious what the benefit of this behavior is.
  • conversely, in the context of finding / replacing globally with patterns such as .* and .*$, the behavior is downright surprising.[3]
    • To ask the question more pointedly: Why does functionality designed to find multiple, non-overlapping matches of a regex - i.e., global matching - decide to even attempt another match if it knows that the entire input has been consumed already, irrespective of what the regex is (although you'll never see the symptom with a regex that doesn't at least also match the empty string)
    • The following languages/engines exhibit the surprising behavior: .NET, Python (both 2.x and 3.x)[2], Perl (both 5.x and 6.x), Ruby, Node.js (JavaScript)

Note that regex engines vary in behavior with respect to where to continue matching after a zero-length (empty-string) match.

Either choice (start at the same character position vs. start at the next) is defensible - see the chapter on zero-length matches at www.regular-expressions.info.

By contrast, the .*$ case discussed here is different in that, with any non-empty input, the first match for .*$ is not a zero-length match, so the behavior difference does not apply - instead, the character position should advance unconditionally after the first match, which of course is impossible if you're already at the end.
Again, my surprise is at the fact that another match is attempted nonetheless, even though there's by definition nothing left.


[1] I'm using $ as the end-of-input marker here, even though in some engines, such as .NET's, it can mark the end the end of the input optionally followed by a trailing newline. However, the behavior equally applies when you use the unconditional end-of-input marker, \z.

[2] Python 2.x and 3.x up to 3.6.x seemingly special-cased replacement behavior in this context: python -c "import re; print(re.sub('.*$', '[\g<0>]', 'a'))" used to yield just [a] - that is, only one match was found and replaced.
Since Python 3.7, the behavior is now like in most other regex engines, where two replacements are performed, yielding [a][].

[3] You can avoid the problem by either (a) choosing a replacement method that is designed to find at most one match or (b) use ^.* to prevent multiple matches from being found via start-of-input anchoring.
(a) may not be an option, depending on how a given language surfaces functionality; for instance, PowerShell's -replace operator invariably replaces all occurrences; consider the following attempt to enclose all array elements in "...":
'a', 'b' -replace '.*', '"$&"'. Due to matching twice, this yields elements "a""" and "b""";
option (b), 'a', 'b' -replace '^.*', '"$&"', fixes the problem.

like image 695
mklement0 Avatar asked Sep 17 '18 14:09

mklement0


People also ask

How does regex matching work?

A regex pattern matches a target string. The pattern is composed of a sequence of atoms. An atom is a single point within the regex pattern which it tries to match to the target string. The simplest atom is a literal, but grouping parts of the pattern to match an atom will require using ( ) as metacharacters.

Which pattern is used to match any non What character?

The expression \w will match any word character. Word characters include alphanumeric characters ( - , - and - ) and underscores (_). \W matches any non-word character.

What is the use of given statement in regular expression a za Z?

Using character sets For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.

What is regex AZ match?

The regular expression [A-Z][a-z]* matches any sequence of letters that starts with an uppercase letter and is followed by zero or more lowercase letters.


Video Answer


1 Answers

I am giving this answer just to demonstrate why a regex would want to allow any code appearing after the final $ anchor in the pattern. Suppose we needed to create a regex to match a string with the following rules:

  • starts with three numbers
  • followed by one or more letters, numbers, hyphen, or underscore
  • ends with only letters and numbers

We could write the following pattern:

^\d{3}[A-Za-z0-9\-_]*[A-Za-z0-9]$ 

But this is a bit bulky, because we have to use two similar character classes adjacent to each other. Instead, we could write the pattern as:

^\d{3}[A-Za-z0-9\-_]+$(?<!_|-) 

or

^\d{3}[A-Za-z0-9\-_]+(?<!_|-)$ 

Here, we eliminated one of the character classes, and instead used a negative lookbehind after the $ anchor to assert that the final character was not underscore or hyphen.

Other than a lookbehind, it makes no sense to me why a regex engine would allow something to appear after the $ anchor. My point here is that a regex engine may allow a lookbehind to appear after the $, and there are cases for which it logically makes sense to do so.

like image 82
Tim Biegeleisen Avatar answered Nov 11 '22 00:11

Tim Biegeleisen