Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding lookahead in R regexp

Tags:

regex

r

I'm trying to use multiple lookaheads to simulate an AND operator in R Perl-type regex with grep. However, I don't understand the output I am seeing. This is a sample code block

a <- c("abcxyz", "abcdef", "defxyz", "abcdefxyz", "xyzdefabc")
grep("(?<=abc)(?=xyz)", a, ignore.case=TRUE, perl=TRUE)  # returns 1
grep("(?=abc)(?=xyz)", a, ignore.case=TRUE, perl=TRUE)  # returns integer(0)

The second line suggests that the position in the string is between abc and xyz, and matches 'abcxyz'. Why does it not match 'abcdefxyz'?

On the third line, I am trying to output 1, 4 and 5, but it returns not found. Why is this happening?

I am using the alternative solution below but I would like to use lookaheads to deal with ordering when dealing with multiple lookaheads.

grep("abc.*xyz|xyz.*abc", a, ignore.case=TRUE, perl=TRUE)  # returns 1 4 5 as expected
like image 231
Naumz Avatar asked Jan 23 '17 08:01

Naumz


1 Answers

The (?<=abc)(?=xyz) regex only matches a location (place in string) that is between abc and xyz. It will find a match in abcxyz but won't find a match in abcdefxyz as the xyz does not immediately follow abc.

The (?=abc)(?=xyz) pattern will never match anything since it matches a location in a string that is followed with a 3-letter sequence that should be equal to abc and xyz at the same time, which is impossible.

What you are looking for is

^(?=.*abc)(?=.*xyz)

Or, to support multiple line input add the DOTALL modifier (?s) (so that . could match line breaks, too):

(?s)^(?=.*abc)(?=.*xyz)

These will match a string that has both abc and xyz in any order.

See R demo:

a <- c("abcxyz", "abcdef", "defxyz", "abcdefxyz", "xyzdefabc")
grep("^(?=.*abc)(?=.*xyz)", a, ignore.case=TRUE, perl=TRUE)
## => [1] 1 4 5
like image 132
Wiktor Stribiżew Avatar answered Sep 30 '22 07:09

Wiktor Stribiżew