Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

split strings on first and last commas

I would like to split strings on the first and last comma. Each string has at least two commas. Below is an example data set and the desired result.

A similar question here asked how to split on the first comma: Split on first comma in string

Here I asked how to split strings on the first two colons: Split string on first two colons

Thank you for any suggestions. I prefer a solution in base R. Sorry if this is a duplicate.

my.data <- read.table(text='

my.string        some.data
123,34,56,78,90     10
87,65,43,21         20
a4,b6,c8888         30
11,bbbb,ccccc       40
uu,vv,ww,xx         50
j,k,l,m,n,o,p       60', header = TRUE, stringsAsFactors=FALSE)

desired.result <- read.table(text='

 my.string1 my.string2 my.string3 some.data
        123   34,56,78         90        10
         87      65,43         21        20
         a4         b6      c8888        30
         11       bbbb      ccccc        40
         uu      vv,ww         xx        50
          j  k,l,m,n,o          p        60', header = TRUE, stringsAsFactors=FALSE)
like image 521
Mark Miller Avatar asked Dec 25 '22 08:12

Mark Miller


2 Answers

You can use the \K operator which keeps text already matched out of the result and a negative look ahead assertion to do this (well almost, there is an annoying comma at the start of the middle portion which I am yet to get rid of in the strsplit). But I enjoyed this as an exercise in constructing a regex...

x <- '123,34,56,78,90'
strsplit( x , "^[^,]+\\K|,(?=[^,]+$)" , perl = TRUE )
#[[1]]
#[1] "123"       ",34,56,78" "90"

Explantion:

  • ^[^,]+ : from the start of the string match one or more characters that are not a ,
  • \\K : but don't include those matched characters in the match
  • So the first match is the first comma...
  • | : or you can match...
  • ,(?=[^,]+$) : a , so long as it is followed by [(?=...)] one or more characters that are not a , until the end of the string ($)...
like image 136
Simon O'Hanlon Avatar answered Dec 27 '22 21:12

Simon O'Hanlon


Here is a relatively simple approach. In the first line we use sub to replace the first and last commas with semicolons producing s. Then we read s using sep=";" and finally cbind the rest of my.data to it:

s <- sub(",(.*),", ";\\1;", my.data[[1]])
DF <- read.table(text=s, sep =";", col.names=paste0("mystring",1:3), as.is=TRUE)
cbind(DF, my.data[-1])

giving:

  mystring1 mystring2 mystring3 some.data
1       123  34,56,78        90        10
2        87     65,43        21        20
3        a4        b6     c8888        30
4        11      bbbb     ccccc        40
5        uu     vv,ww        xx        50
6         j k,l,m,n,o         p        60
like image 23
G. Grothendieck Avatar answered Dec 27 '22 22:12

G. Grothendieck