strsplit inconsistent with gregexpr

Question

A comment on my answer to this question which should give the desired result using strsplit does not, even though it seems to correctly match the first and last commas in a character vector. This can be proved using gregexpr and regmatches.

So why does strsplit split on each comma in this example, even though regmatches only returns two matches for the same regex?

#  We would like to split on the first comma and
#  the last comma (positions 4 and 13 in this string)
x <- "123,34,56,78,90"

#  Splits on every comma. Must be wrong.
strsplit( x , '^\w+\K,|,(?=\w+$)' , perl = TRUE )[[1]]
#[1] "123" "34"  "56"  "78"  "90" 


#  Ok. Let's check the positions of matches for this regex
m <- gregexpr( '^\w+\K,|,(?=\w+$)' , x , perl = TRUE )

# Matching positions are at
unlist(m)
[1]  4 13

#  And extracting them...
regmatches( x , m )
[[1]]
[1] "," ","

Huh?! What is going on?

Casimir et Hippolyte · Accepted Answer

The theory of @Aprillion is exact, from R documentation:

The algorithm applied to each input string is

repeat {
    if the string is empty
        break.
    if there is a match
        add the string to the left of the match to the output.
        remove the match and all to the left of it.
    else
        add the string to the output.
        break.
}

In other words, at each iteration ^ will match the begining of a new string (without the precedent items.)

To simply illustrate this behavior:

> x <- "12345"
> strsplit( x , "^." , perl = TRUE )
[[1]]
[1] "" "" "" "" ""

Here, you can see the consequence of this behavior with a lookahead assertion as delimiter (Thanks to @JoshO'Brien for the link.)

strsplit inconsistent with gregexpr

Tags:

regex

r

pcre

strsplit

Simon O'Hanlon

1 Answers

Casimir et Hippolyte

Recent Activity

Donate For Us

strsplit inconsistent with gregexpr

Tags:

regex

r

pcre

strsplit

Simon O'Hanlon

1 Answers

Casimir et Hippolyte

Related questions

Recent Activity

Donate For Us