Odd Behavior with Greedy Modifiers Inside Capture Groups

Question

Consider the following commands:

text <- "abcdEEEEfg"

sub("c.+?E", "###", text)
# [1] "ab###EEEfg"                          <<< OKAY
sub("c(.+?)E", "###", text)
# [1] "ab###EEfg"                           <<< WEIRD
sub("c(.+?)E", "###", text, perl=T)
# [1] "ab###EEEfg"                          <<< OKAY

The first does exactly what I expect, basically matching just the first E. The second one should essentially be identical to the first, since all I'm doing is adding a capturing group (though I'm not using it), yet for some reason it captures an extra E. That said, it isn't fully greedy (i.e. if it was it would have captured all the Es). Even weirder, it actually still matches the pattern, even though the sub result suggests the .+? piece left out EE, which can no longer be matched by the rest of the regular expression. This suggests there is an offset issue when computing the length of the matched sub-expression, rather than in the actual matching.

The final one is exactly the same but run with PCRE, and that works as expected.

Am I missing something or is this behavior undocumented/buggy?

Christopher Louden · Accepted Answer

R uses libtre, version 0.8. For more stability, you should always use perl = TRUE.

Note that

sub("c(.+?)E?", "###", text)

works.

Odd Behavior with Greedy Modifiers Inside Capture Groups

Tags:

regex

r

posix-ere

BrodieG

1 Answers

Christopher Louden

Recent Activity

Donate For Us

Odd Behavior with Greedy Modifiers Inside Capture Groups

Tags:

regex

r

posix-ere

BrodieG

1 Answers

Christopher Louden

Related questions

Recent Activity

Donate For Us