Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

String split in R with complex divisions

Tags:

string

r

I have a df (day.df) with the column vial which I am trying to split in to four new columns. The new columns will be treatment gender line block. The day.df dataframe also has the columns response & explanatory which will be retained.

So day.df currently looks like this (top 4 of 31000 rows):

    vial    response explanatory
    Xm1.1   0        4
    Xm2.1   0        4
    Xm3.1   0        4
    Xm4.1   0        4
    .       .        .
    .       .        .        
    .       .        .

The current contents of the vial column look like this.. Xm1.2.

  • The first character (shown as X) can be X or A - this will be the treament.
  • The second character (shown in the example as m) can be m or f- this is the gender.
  • The third character (shown as 1) and ranges from 1-40 - this is the line.
  • The fourth and final character is the block and ranges from 1-4
  • The "." needs to be discarded

As such the new day.df will look something like this (I use four "random" rows to illustrate the variation within each new column):

        vial    response explanatory  treatment gender line  block
        Xm1.1   0        4            X         m      1     1
        Am1.1   0        4            A         m      1     1
        Xf3.2   0        4            X         f      3     2
        Xm4.2   0        4            X         m      4     2
        .       .        .
        .       .        .        
        .       .        .

I've taken a look around online for how to do this and this is the closest I came; I tried to split the vial column like this...

 > a=strsplit(day.df$vial,"")
 > a[1] "Xm1.2"

but had problems when the "line" section of the string went >9 because then two character were there, e.g (for the row where vial is Af20.2).

 > a[300]
 [[1]]
 [1] "A" "f" "2" "0" "." "2"

Should read as:

 > a[300]
 [[1]]
 [1] "A" "f" "20" "." "2"



So the steps I need help solving are:

  1. Overcome the problem with the line section of the string when over 9.
  2. Add the list of the split string to the day.df dataframe in the four required columns
like image 653
rg255 Avatar asked Jul 05 '13 11:07

rg255


People also ask

How do I split a string into multiple spaces in R?

Method 1: Using strsplit() function strsplit() function is used to split the string based on some condition.

How do I split a string in a Dataframe in R?

To split a column into multiple columns in the R Language, We use the str_split_fixed() function of the stringr package library. The str_split_fixed() function splits up a string into a fixed number of pieces.


2 Answers

using gsub and strsplit like this :

v <- c('Xm1.1','Xf3.2')
h <- gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])','\\1|\\2|\\3|\\4',v)
do.call(rbind,strsplit(h,'[|]'))

    [,1] [,2] [,3] [,4]
[1,] "X"  "m"  "1"  "1" 
[2,] "X"  "f"  "3"  "2" 

the result it is a data.frame, you can cbind it to your original data.frame.

EDIT @GriffinEvo Applied & tested code:

 a = gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])',
           '\\1|\\2|\\3|\\4',day.df$vial) 

 do.call(rbind, strsplit(a,'[|]') )
 day.df = cbind(day.df,do.call(rbind,strsplit(a,'[|]'))) 
 colnames(day.df)[4:7] = c ("treatment" , "gender" , "line" , "block")
like image 170
agstudy Avatar answered Nov 20 '22 06:11

agstudy


Read the data:

Lines <- "vial    response explanatory
    Xm1.1   0        4
    Xm2.1   0        4
    Xm3.1   0        4
    Xm4.1   0        4
"

day.df <- read.table(text = Lines, header = TRUE, as.is = TRUE)

1) then process it using strapplyc. (we used as.is=TRUE so that day.df$vial is character but if its a factor in your data frame then replace day.df$Vial with as.character(day.df$vial). ) This approach does the parsing in just one short line of code:

library(gsubfn)    
s <- strapplyc(day.df$vial, "(.)(.)(\\d+)[.](.)", simplify = rbind)

# we can now cbind it to the original data frame
colnames(s) <- c("treatment", "gender", "line", "block")
cbind(day.df, s)

which gives:

  vial response explanatory treatment gender line block
1 Xm1.1        0           4         X      m    1     1
2 Xm2.1        0           4         X      m    2     1
3 Xm3.1        0           4         X      m    3     1
4 Xm4.1        0           4         X      m    4     1

2) Here is a different approach. This does not use any packages and is relatively simple (no regular expressions at all) and only involves one R statement including the cbind'ing:

transform(day.df,
 treatment = substring(vial, 1, 1),        # 1st char
 gender = substring(vial, 2, 2),           # 2nd char
 line = substring(vial, 3, nchar(vial)-2), # 3rd through 2 prior to last char
 block = substring(vial, nchar(vial)))     # last char

The result is as before.

UPDATE: Added second approach.

UPDATE: Some simplifications.

like image 40
G. Grothendieck Avatar answered Nov 20 '22 06:11

G. Grothendieck