I am trying to replace instances in a string which begin with abc
in a text I'm working with in R. The output text is highlighted in HTML over a couple of passes, so I need the replacement to ignore text inside HTML carets.
The following seems to work in Python but I'm not getting any hits on my regex in R. All help appreciated.
test <- 'abcdef abc<span abc>defabc abcdef</span> abc defabc'
gsub('\\babc\\(?![^<]*>\\)', 'xxx', test)
Expected output:
xxxdef xxx<span abc>defabc xxxdef</span> xxx defabc
Instead it is ignoring all instances of abc
.
Positive and Negative LookbehindIt tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there. (? <!a)b matches a “b” that is not preceded by an “a”, using negative lookbehind.
Lookbehind assertion: Matches "x" only if "x" is preceded by "y". For example, /(? <=Jack)Sprat/ matches "Sprat" only if it is preceded by "Jack". /(?
Negative lookahead, which is what you're after, requires a more powerful tool than the standard grep . You need a PCRE-enabled grep. If you have GNU grep , the current version supports options -P or --perl-regexp and you can then use the regex you wanted.
Negative lookahead That's a number \d+ , NOT followed by € . For that, a negative lookahead can be applied. The syntax is: X(?! Y) , it means "search X , but only if not followed by Y ".
You need to remove unnecessary escapes and use perl=TRUE
:
test <- 'abcdef abc<span abc>defabc abcdef</span> abc defabc'
gsub('\\babc(?![^<]*>)', 'xxx', test, perl=TRUE)
## => [1] "xxxdef xxx<span abc>defabc xxxdef</span> xxx defabc"
See the online R demo
When you escape (
, it matches a literal (
symbol, so, in your pattern, \\(?![^<]*>\\)
matches a (
1 or 0 times, then !
, then 0+ chars other than <
, then >
and a literal )
. In my regex, (?![^<]*>)
is a negative lookahead that fails the match if an abc
is followed with any 0+ chars other than <
and then a >
.
Without perl=TRUE
, R gsub
uses the TRE regex flavor that does not support lookarounds (even lookaheads). Thus, you have to tell gsub
via perl=TRUE
that you want the PCRE engine to be used.
See the online PCRE regex demo.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With