Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inconsistent behavior between str_split and strsplit

Tags:

string

split

r

The documentation for str_split in the stringr package states that for the pattern argument:

If "" splits into individual characters.

which suggests it behaves the same as strsplit in this regard. However,

library(stringr)
str_split("abcab","")
[[1]]
[1] ""  "a" "b" "c" "a" "b"

with a leading empty string. This compares with,

strsplit("abcab","")
[[1]]
[1] "a" "b" "c" "a" "b"

Leading empty strings seems to be normal behavior when splitting on non-empty strings,

strsplit("abcab","ab")
[[1]]
[1] ""  "c"

but even then, str_split generates an 'extra' trailing empty string:

str_split("abcab","ab")
[[1]]
[1] ""  "c" "" 

Is this discrepancy a bug, feature, an error in the documentation or just a different notion of what's 'expected behavior'?

like image 430
joran Avatar asked Sep 09 '11 20:09

joran


1 Answers

If you use commas as delimiters, the "expected" (your mileage may vary) result is more obvious:

# expect "" "2" "3" "4" ""

strsplit(",2,3,4,", ",")
# [[1]]
# [1] ""  "2" "3" "4"

str_split(",2,3,4,", ",")
# [[1]]
# [1] ""  "2" "3" "4" "" 

If I have n commas then I expect (n+1) elements to be returned. So I prefer the results from str_split. However, I wouldn't necessarily call this a bug in strsplit, since in performs as advertised:

(from ?strplit) Note that this means that if there is a match at the beginning of a (non-empty) string, the first element of the output is ‘""’, but if there is a match at the end of the string, the output is the same as with the match removed.

"" is trickier, as there is no way to count the number of times "" appears in a string. Therefore treating it as a special case seems justified.

(from ?str_split) If ‘""’ splits into individual characters.

Based on this I suggest you have found a bug and should take hadley's advice and report it!

like image 133
pete Avatar answered Nov 02 '22 13:11

pete