Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Surprising, but correct behavior of a greedy subexpression in a positive lookbehind assertion

Note:

  • The observed behavior is correct, but may at first be surprising; it was to me, and I think it may be to others as well - though probably not to those intimately familiar with regex engines.

  • The repeatedly suggested duplicate, Regex lookahead, lookbehind and atomic groups, contains general information about look-around assertions, but does not address the specific misconception at hand, as discussed in more detail in the comments below.


Using a greedy, by definition variable-width subexpression inside a positive look-behind assertion can exhibit surprising behavior.

The examples use PowerShell for convenience, but the behavior applies to the .NET regex engine in general:

This command works as I intuitively expect:

# OK:  
#     The subexpression matches greedily from the start up to and
#     including the last "_", and, by including the matched string ($&) 
#     in the replacement string, effectively inserts "|" there - and only there.
PS> 'a_b_c' -replace '^.+_', '$&|'
a_b_|c

The following command, which uses a positive look-behind assertion, (?<=...), is seemingly equivalent - but isn't:

# CORRECT, but SURPRISING:
#   Use a positive lookbehind assertion to *seemingly* match
#   only up to and including the last "_", and insert a "|" there.
PS> 'a_b_c' -replace '(?<=^.+_)', '|'
a_|b_|c  # !! *multiple* insertions were performed

Why isn't it equivalent? Why were multiple insertions performed?

like image 254
mklement0 Avatar asked Mar 02 '21 19:03

mklement0


1 Answers

tl;dr:

  • Inside a look-behind assertion, a greedy subexpression in effect behaves non-greedily (in global matching in addition to acting greedily), due to considering every prefix string of the input string.

My problem was that I hadn't considered that, in a look-behind assertion, each and every character position in the input string must be checked for the preceding text up to that point to match the subexpression in the lookbehind assertion.

This, combined with the always-global replacement that PowerShell's -replace operator performs (that is, all possible matches are performed), resulted in multiple insertions:

That is, the greedy, anchored subexpression ^.+_ legitimately matched twice, when considering the text to the left of the character position currently being considered:

  • First, when a_ was the text to the left.
  • And again when a_b_ was the text to the left.

Therefore, two insertions of | resulted.


By contrast, without a look-behind assertion, greedy expression ^.+_ by definition only matches once, through to the last _, because it is only applied to the entire input string.

like image 97
mklement0 Avatar answered Oct 18 '22 15:10

mklement0