Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Odd Behavior with Greedy Modifiers Inside Capture Groups

Tags:

regex

r

posix-ere

Consider the following commands:

text <- "abcdEEEEfg"

sub("c.+?E", "###", text)
# [1] "ab###EEEfg"                          <<< OKAY
sub("c(.+?)E", "###", text)
# [1] "ab###EEfg"                           <<< WEIRD
sub("c(.+?)E", "###", text, perl=T)
# [1] "ab###EEEfg"                          <<< OKAY  

The first does exactly what I expect, basically matching just the first E. The second one should essentially be identical to the first, since all I'm doing is adding a capturing group (though I'm not using it), yet for some reason it captures an extra E. That said, it isn't fully greedy (i.e. if it was it would have captured all the Es). Even weirder, it actually still matches the pattern, even though the sub result suggests the .+? piece left out EE, which can no longer be matched by the rest of the regular expression. This suggests there is an offset issue when computing the length of the matched sub-expression, rather than in the actual matching.

The final one is exactly the same but run with PCRE, and that works as expected.

Am I missing something or is this behavior undocumented/buggy?

like image 262
BrodieG Avatar asked Feb 26 '14 23:02

BrodieG


1 Answers

R uses libtre, version 0.8. For more stability, you should always use perl = TRUE.

Note that

sub("c(.+?)E?", "###", text)

works.

like image 190
Christopher Louden Avatar answered Nov 04 '22 00:11

Christopher Louden