Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

strsplit inconsistent with gregexpr

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.

So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?

#  We would like to split on the first comma and
#  the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"

#  Splits on every comma. Must be wrong.
strsplit( x , '^\\w+\\K,|,(?=\\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34"  "56"  "78"  "90" 


#  Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\\w+\\K,|,(?=\\w+$)' , x , perl = TRUE )

# Matching positions are at
unlist(m)
[1]  4 13

#  And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","

Huh?! What is going on?

like image 715
Simon O'Hanlon Avatar asked May 31 '14 11:05

Simon O'Hanlon


1 Answers

The theory of @Aprillion is exact, from R documentation:

The algorithm applied to each input string is

repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}

In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)

To simply illustrate this behavior:

> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""

Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to @JoshO'Brien for the link.)

like image 194
Casimir et Hippolyte Avatar answered Oct 15 '22 05:10

Casimir et Hippolyte