Common sense and a sanity-check using gregexpr()
indicate that the look-behind and look-ahead assertions below should each match at exactly one location in testString
:
testString <- "text XX text" BB <- "(?<= XX )" FF <- "(?= XX )" as.vector(gregexpr(BB, testString, perl=TRUE)[[1]]) # [1] 9 as.vector(gregexpr(FF, testString, perl=TRUE)[[1]][1]) # [1] 5
strsplit()
, however, uses those match locations differently, splitting testString
at one location when using the lookbehind assertion, but at two locations -- the second of which seems incorrect -- when using the lookahead assertion.
strsplit(testString, BB, perl=TRUE) # [[1]] # [1] "text XX " "text" strsplit(testString, FF, perl=TRUE) # [[1]] # [1] "text" " " "XX text"
I have two questions: (Q1) What's going on here? And (Q2) how can one get strsplit()
to be better behaved?
Update: Theodore Lytras' excellent answer explains what's going on, and so addresses (Q1). My answer builds on his to identify a remedy, addressing (Q2).
I am not sure whether this qualifies as a bug, because I believe this is expected behaviour based on the R documentation. From ?strsplit
:
The algorithm applied to each input string is
repeat { if the string is empty break. if there is a match add the string to the left of the match to the output. remove the match and all to the left of it. else add the string to the output. break. }
Note that this means that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as with the match removed.
The problem is that lookahead (and lookbehind) assertions are zero-length. So for example in this case:
FF <- "(?=funky)" testString <- "take me to funky town" gregexpr(FF,testString,perl=TRUE) # [[1]] # [1] 12 # attr(,"match.length") # [1] 0 # attr(,"useBytes") # [1] TRUE strsplit(testString,FF,perl=TRUE) # [[1]] # [1] "take me to " "f" "unky town"
What happens is that the lonely lookahead (?=funky)
matches at position 12. So the first split includes the string up to position 11 (left of the match), and it is removed from the string, together with the match, which -however- has zero length.
Now the remaining string is funky town
, and the lookahead matches at position 1. However there's nothing to remove, because there's nothing at the left of the match, and the match itself has zero length. So the algorithm is stuck in an infinite loop. Apparently R resolves this by splitting a single character, which incidentally is the documented behaviour when strsplit
ing with an empty regex (when argument split=""
). After this the remaining string is unky town
, which is returned as the last split since there's no match.
Lookbehinds are no problem, because each match is split and removed from the remaining string, so the algorithm is never stuck.
Admittedly this behaviour looks weird at first glance. Behaving otherwise however would violate the assumption of zero length for lookaheads. Given that the strsplit
algorithm is documented, I belive this does not meet the definition of a bug.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With