I'm trying to use multiple lookaheads to simulate an AND operator in R Perl-type regex with grep
. However, I don't understand the output I am seeing. This is a sample code block
a <- c("abcxyz", "abcdef", "defxyz", "abcdefxyz", "xyzdefabc")
grep("(?<=abc)(?=xyz)", a, ignore.case=TRUE, perl=TRUE) # returns 1
grep("(?=abc)(?=xyz)", a, ignore.case=TRUE, perl=TRUE) # returns integer(0)
The second line suggests that the position in the string is between abc and xyz, and matches 'abcxyz'. Why does it not match 'abcdefxyz'?
On the third line, I am trying to output 1, 4 and 5, but it returns not found. Why is this happening?
I am using the alternative solution below but I would like to use lookaheads to deal with ordering when dealing with multiple lookaheads.
grep("abc.*xyz|xyz.*abc", a, ignore.case=TRUE, perl=TRUE) # returns 1 4 5 as expected
The (?<=abc)(?=xyz)
regex only matches a location (place in string) that is between abc
and xyz
. It will find a match in abcxyz
but won't find a match in abcdefxyz
as the xyz
does not immediately follow abc
.
The (?=abc)(?=xyz)
pattern will never match anything since it matches a location in a string that is followed with a 3-letter sequence that should be equal to abc
and xyz
at the same time, which is impossible.
What you are looking for is
^(?=.*abc)(?=.*xyz)
Or, to support multiple line input add the DOTALL modifier (?s)
(so that .
could match line breaks, too):
(?s)^(?=.*abc)(?=.*xyz)
These will match a string that has both abc
and xyz
in any order.
See R demo:
a <- c("abcxyz", "abcdef", "defxyz", "abcdefxyz", "xyzdefabc")
grep("^(?=.*abc)(?=.*xyz)", a, ignore.case=TRUE, perl=TRUE)
## => [1] 1 4 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With