Note:
The observed behavior is correct, but may at first be surprising; it was to me, and I think it may be to others as well - though probably not to those intimately familiar with regex engines.
The repeatedly suggested duplicate, Regex lookahead, lookbehind and atomic groups, contains general information about look-around assertions, but does not address the specific misconception at hand, as discussed in more detail in the comments below.
Using a greedy, by definition variable-width subexpression inside a positive look-behind assertion can exhibit surprising behavior.
The examples use PowerShell for convenience, but the behavior applies to the .NET regex engine in general:
This command works as I intuitively expect:
# OK:
# The subexpression matches greedily from the start up to and
# including the last "_", and, by including the matched string ($&)
# in the replacement string, effectively inserts "|" there - and only there.
PS> 'a_b_c' -replace '^.+_', '$&|'
a_b_|c
The following command, which uses a positive look-behind assertion, (?<=...)
, is seemingly equivalent - but isn't:
# CORRECT, but SURPRISING:
# Use a positive lookbehind assertion to *seemingly* match
# only up to and including the last "_", and insert a "|" there.
PS> 'a_b_c' -replace '(?<=^.+_)', '|'
a_|b_|c # !! *multiple* insertions were performed
Why isn't it equivalent? Why were multiple insertions performed?
tl;dr:
My problem was that I hadn't considered that, in a look-behind assertion, each and every character position in the input string must be checked for the preceding text up to that point to match the subexpression in the lookbehind assertion.
This, combined with the always-global replacement that PowerShell's -replace
operator performs (that is, all possible matches are performed), resulted in multiple insertions:
That is, the greedy, anchored subexpression ^.+_
legitimately matched twice, when considering the text to the left of the character position currently being considered:
a_
was the text to the left.a_b_
was the text to the left.Therefore, two insertions of |
resulted.
By contrast, without a look-behind assertion, greedy expression ^.+_
by definition only matches once, through to the last _
, because it is only applied to the entire input string.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With