Consider the following commands:
text <- "abcdEEEEfg"
sub("c.+?E", "###", text)
# [1] "ab###EEEfg" <<< OKAY
sub("c(.+?)E", "###", text)
# [1] "ab###EEfg" <<< WEIRD
sub("c(.+?)E", "###", text, perl=T)
# [1] "ab###EEEfg" <<< OKAY
The first does exactly what I expect, basically matching just the first E. The second one should essentially be identical to the first, since all I'm doing is adding a capturing group (though I'm not using it), yet for some reason it captures an extra E. That said, it isn't fully greedy (i.e. if it was it would have captured all the Es). Even weirder, it actually still matches the pattern, even though the sub
result suggests the .+?
piece left out EE
, which can no longer be matched by the rest of the regular expression. This suggests there is an offset issue when computing the length of the matched sub-expression, rather than in the actual matching.
The final one is exactly the same but run with PCRE, and that works as expected.
Am I missing something or is this behavior undocumented/buggy?
R uses libtre
, version 0.8. For more stability, you should always use perl = TRUE
.
Note that
sub("c(.+?)E?", "###", text)
works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With