Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does strsplit use positive lookahead and lookbehind assertion matches differently?

Tags:

regex

r

strsplit

Common sense and a sanity-check using gregexpr() indicate that the look-behind and look-ahead assertions below should each match at exactly one location in testString:

testString <- "text XX text" BB  <- "(?<= XX )" FF  <- "(?= XX )"  as.vector(gregexpr(BB, testString, perl=TRUE)[[1]]) # [1] 9 as.vector(gregexpr(FF, testString, perl=TRUE)[[1]][1]) # [1] 5 

strsplit(), however, uses those match locations differently, splitting testString at one location when using the lookbehind assertion, but at two locations -- the second of which seems incorrect -- when using the lookahead assertion.

strsplit(testString, BB, perl=TRUE) # [[1]] # [1] "text XX " "text"      strsplit(testString, FF, perl=TRUE) # [[1]] # [1] "text"    " "       "XX text" 

I have two questions: (Q1) What's going on here? And (Q2) how can one get strsplit() to be better behaved?


Update: Theodore Lytras' excellent answer explains what's going on, and so addresses (Q1). My answer builds on his to identify a remedy, addressing (Q2).

like image 639
Josh O'Brien Avatar asked Mar 22 '13 16:03

Josh O'Brien


1 Answers

I am not sure whether this qualifies as a bug, because I believe this is expected behaviour based on the R documentation. From ?strsplit:

The algorithm applied to each input string is

repeat {     if the string is empty         break.     if there is a match         add the string to the left of the match to the output.         remove the match and all to the left of it.     else         add the string to the output.         break. } 

Note that this means that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as with the match removed.

The problem is that lookahead (and lookbehind) assertions are zero-length. So for example in this case:

FF <- "(?=funky)" testString <- "take me to funky town"  gregexpr(FF,testString,perl=TRUE) # [[1]] # [1] 12 # attr(,"match.length") # [1] 0 # attr(,"useBytes") # [1] TRUE  strsplit(testString,FF,perl=TRUE) # [[1]] # [1] "take me to " "f"           "unky town"  

What happens is that the lonely lookahead (?=funky) matches at position 12. So the first split includes the string up to position 11 (left of the match), and it is removed from the string, together with the match, which -however- has zero length.

Now the remaining string is funky town, and the lookahead matches at position 1. However there's nothing to remove, because there's nothing at the left of the match, and the match itself has zero length. So the algorithm is stuck in an infinite loop. Apparently R resolves this by splitting a single character, which incidentally is the documented behaviour when strspliting with an empty regex (when argument split=""). After this the remaining string is unky town, which is returned as the last split since there's no match.

Lookbehinds are no problem, because each match is split and removed from the remaining string, so the algorithm is never stuck.

Admittedly this behaviour looks weird at first glance. Behaving otherwise however would violate the assumption of zero length for lookaheads. Given that the strsplit algorithm is documented, I belive this does not meet the definition of a bug.

like image 165
Theodore Lytras Avatar answered Sep 25 '22 19:09

Theodore Lytras