Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string on first two colons

Tags:

string

regex

r

I would like to split a column of strings on the first two colons, but not on any subsequent colons:

my.data <- read.table(text='

my.string    some.data
123:34:56:78   -100
87:65:43:21    -200
a4:b6:c8888    -300
11:bbbb:ccccc  -400
uu:vv:ww:xx    -500', header = TRUE)

desired.result <- read.table(text='

my.string1  my.string2  my.string3  some.data
123         34          56:78         -100
87          65          43:21         -200
a4          b6          c8888         -300
11          bbbb        ccccc         -400
uu          vv          ww:xx         -500', header = TRUE)

I have searched extensively and the following question is the closest to my current dilemma:

Split on first comma in string

Thank you for any suggestions. I prefer to use base R.

EDIT:

The number of characters before the first colon is not always two and the number of characters between the first two colons is not always two. So, I edited the example to reflect this.

like image 521
Mark Miller Avatar asked Nov 03 '13 03:11

Mark Miller


2 Answers

In base R:

> my.data <- read.table(text='
+ 
+ my.string    some.data
+ 123:34:56:78   -100
+ 87:65:43:21    -200
+ a4:b6:c8888    -300
+ 11:bbbb:ccccc  -400
+ uu:vv:ww:xx    -500', header = TRUE,stringsAsFactors=FALSE)
> m <- regexec ("^([^:]+):([^:]+):(.*)$",my.data$my.string)
> my.data$my.string1 <- unlist(lapply(regmatches(my.data$my.string,m),'[',c(2)))
> my.data$my.string2 <- unlist(lapply(regmatches(my.data$my.string,m),'[',c(3)))
> my.data$my.string3 <- unlist(lapply(regmatches(my.data$my.string,m),'[',c(4)))
> my.data
      my.string some.data my.string1 my.string2 my.string3
1  123:34:56:78      -100        123         34      56:78
2   87:65:43:21      -200         87         65      43:21
3   a4:b6:c8888      -300         a4         b6      c8888
4 11:bbbb:ccccc      -400         11       bbbb      ccccc
5   uu:vv:ww:xx      -500         uu         vv      ww:xx

You'll see I've used stringsAsFactors=FALSE to ensure that my.string can be processed as a vector of strings.

like image 166
Simon Avatar answered Oct 27 '22 19:10

Simon


Using package stringr:

str_match(my.data$my.string, "(.+?):(.+?):(.*)")

     [,1]            [,2]  [,3]   [,4]   
[1,] "123:34:56:78"  "123" "34"   "56:78"
[2,] "87:65:43:21"   "87"  "65"   "43:21"
[3,] "a4:b6:c8888"   "a4"  "b6"   "c8888"
[4,] "11:bbbb:ccccc" "11"  "bbbb" "ccccc"
[5,] "uu:vv:ww:xx"   "uu"  "vv"   "ww:xx"

UPDATE: with latest example (above) and Hadley's comment solution:

str_split_fixed(my.data$my.string, ":", 3)
     [,1]  [,2]   [,3]   
[1,] "123" "34"   "56:78"
[2,] "87"  "65"   "43:21"
[3,] "a4"  "b6"   "c8888"
[4,] "11"  "bbbb" "ccccc"
[5,] "uu"  "vv"   "ww:xx"
like image 39
topchef Avatar answered Oct 27 '22 18:10

topchef