Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Understanding lookahead in R regexp




I'm trying to use multiple lookaheads to simulate an AND operator in R Perl-type regex with grep. However, I don't understand the output I am seeing. This is a sample code block

a <- c("abcxyz", "abcdef", "defxyz", "abcdefxyz", "xyzdefabc")
grep("(?<=abc)(?=xyz)", a, ignore.case=TRUE, perl=TRUE)  # returns 1
grep("(?=abc)(?=xyz)", a, ignore.case=TRUE, perl=TRUE)  # returns integer(0)

The second line suggests that the position in the string is between abc and xyz, and matches 'abcxyz'. Why does it not match 'abcdefxyz'?

On the third line, I am trying to output 1, 4 and 5, but it returns not found. Why is this happening?

I am using the alternative solution below but I would like to use lookaheads to deal with ordering when dealing with multiple lookaheads.

grep("abc.*xyz|xyz.*abc", a, ignore.case=TRUE, perl=TRUE)  # returns 1 4 5 as expected
like image 231
Naumz Avatar asked Jan 23 '17 08:01


1 Answers

The (?<=abc)(?=xyz) regex only matches a location (place in string) that is between abc and xyz. It will find a match in abcxyz but won't find a match in abcdefxyz as the xyz does not immediately follow abc.

The (?=abc)(?=xyz) pattern will never match anything since it matches a location in a string that is followed with a 3-letter sequence that should be equal to abc and xyz at the same time, which is impossible.

What you are looking for is


Or, to support multiple line input add the DOTALL modifier (?s) (so that . could match line breaks, too):


These will match a string that has both abc and xyz in any order.

See R demo:

a <- c("abcxyz", "abcdef", "defxyz", "abcdefxyz", "xyzdefabc")
grep("^(?=.*abc)(?=.*xyz)", a, ignore.case=TRUE, perl=TRUE)
## => [1] 1 4 5
like image 132
Wiktor Stribiżew Avatar answered Sep 30 '22 07:09

Wiktor Stribiżew