I have a df (<code>day.df</code>) with the column <code>vial</code> which I am trying to split in to four new columns. The new columns will be <code>treatment</code> <code>gender</code> <code>line</code> <code>block</code>. The <code>day.df</code> dataframe also has the columns <code>response</code> & <code>explanatory</code> which will be retained. So <code>day.df</code> currently looks like this (top 4 of 31000 rows): <pre class="prettyprint"><code> vial response explanatory Xm1.1 0 4 Xm2.1 0 4 Xm3.1 0 4 Xm4.1 0 4 . . . . . . . . . </code></pre> The current contents of the <code>vial</code> column look like this.. <code>Xm1.2</code>. <ul> <li>The first character (shown as X) can be <code>X</code> or <code>A</code> - this will be the <code>treament</code>. </li> <li>The second character (shown in the example as <code>m</code>) can be <code>m</code> or <code>f</code>- this is the <code>gender</code>.</li> <li>The third character (shown as <code>1</code>) and ranges from <code>1</code>-<code>40</code> - this is the <code>line</code>.</li> <li>The fourth and final character is the <code>block</code> and ranges from <code>1</code>-<code>4</code> </li> <li>The "." needs to be discarded</li> </ul> As such the new <code>day.df</code> will look something like this (I use four "random" rows to illustrate the variation within each new column): <pre class="prettyprint"><code> vial response explanatory treatment gender line block Xm1.1 0 4 X m 1 1 Am1.1 0 4 A m 1 1 Xf3.2 0 4 X f 3 2 Xm4.2 0 4 X m 4 2 . . . . . . . . . </code></pre> I've taken a look around online for how to do this and this is the closest I came; I tried to split the <code>vial</code> column like this... <pre class="prettyprint"><code> > a=strsplit(day.df$vial,"") > a[1] "Xm1.2" </code></pre> but had problems when the "line" section of the string went >9 because then two character were there, e.g (for the row where <code>vial</code> is <code>Af20.2</code>). <pre class="prettyprint"><code> > a[300] [[1]] [1] "A" "f" "2" "0" "." "2" </code></pre> Should read as: <pre class="prettyprint"><code> > a[300] [[1]] [1] "A" "f" "20" "." "2" </code></pre> So the steps I need help solving are: <ol> <li>Overcome the problem with the <code>line</code> section of the string when over 9.</li> <li>Add the list of the split string to the <code>day.df</code> dataframe in the four required columns</li> </ol>

using <code>gsub</code> and <code>strsplit</code> like this : <pre class="prettyprint"><code>v <- c('Xm1.1','Xf3.2') h <- gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])','\\1|\\2|\\3|\\4',v) do.call(rbind,strsplit(h,'[|]')) [,1] [,2] [,3] [,4] [1,] "X" "m" "1" "1" [2,] "X" "f" "3" "2" </code></pre> the result it is a data.frame, you can <code>cbind</code> it to your original data.frame. EDIT @GriffinEvo Applied & tested code: <pre class="prettyprint"><code> a = gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])', '\\1|\\2|\\3|\\4',day.df$vial) do.call(rbind, strsplit(a,'[|]') ) day.df = cbind(day.df,do.call(rbind,strsplit(a,'[|]'))) colnames(day.df)[4:7] = c ("treatment" , "gender" , "line" , "block") </code></pre>

Read the data: <pre class="prettyprint"><code>Lines <- "vial response explanatory Xm1.1 0 4 Xm2.1 0 4 Xm3.1 0 4 Xm4.1 0 4 " day.df <- read.table(text = Lines, header = TRUE, as.is = TRUE) </code></pre> 1) then process it using <code>strapplyc</code>. (we used <code>as.is=TRUE</code> so that <code>day.df$vial</code> is character but if its a <code>factor</code> in your data frame then replace <code>day.df$Vial</code> with <code>as.character(day.df$vial)</code>. ) This approach does the parsing in just one short line of code: <pre class="prettyprint"><code>library(gsubfn) s <- strapplyc(day.df$vial, "(.)(.)(\\d+)[.](.)", simplify = rbind) # we can now cbind it to the original data frame colnames(s) <- c("treatment", "gender", "line", "block") cbind(day.df, s) </code></pre> which gives: <pre class="prettyprint"><code> vial response explanatory treatment gender line block 1 Xm1.1 0 4 X m 1 1 2 Xm2.1 0 4 X m 2 1 3 Xm3.1 0 4 X m 3 1 4 Xm4.1 0 4 X m 4 1 </code></pre> 2) Here is a different approach. This does not use any packages and is relatively simple (no regular expressions at all) and only involves one R statement including the cbind'ing: <pre class="prettyprint"><code>transform(day.df, treatment = substring(vial, 1, 1), # 1st char gender = substring(vial, 2, 2), # 2nd char line = substring(vial, 3, nchar(vial)-2), # 3rd through 2 prior to last char block = substring(vial, nchar(vial))) # last char </code></pre> The result is as before. UPDATE: Added second approach. UPDATE: Some simplifications.

String split in R with complex divisions

Tags:

string

r

I have a df (day.df) with the column vial which I am trying to split in to four new columns. The new columns will be treatment gender line block. The day.df dataframe also has the columns response & explanatory which will be retained.

So day.df currently looks like this (top 4 of 31000 rows):

    vial    response explanatory
    Xm1.1   0        4
    Xm2.1   0        4
    Xm3.1   0        4
    Xm4.1   0        4
    .       .        .
    .       .        .        
    .       .        .

The current contents of the vial column look like this.. Xm1.2.

The first character (shown as X) can be X or A - this will be the treament.
The second character (shown in the example as m) can be m or f- this is the gender.
The third character (shown as 1) and ranges from 1-40 - this is the line.
The fourth and final character is the block and ranges from 1-4
The "." needs to be discarded

As such the new day.df will look something like this (I use four "random" rows to illustrate the variation within each new column):

        vial    response explanatory  treatment gender line  block
        Xm1.1   0        4            X         m      1     1
        Am1.1   0        4            A         m      1     1
        Xf3.2   0        4            X         f      3     2
        Xm4.2   0        4            X         m      4     2
        .       .        .
        .       .        .        
        .       .        .

I've taken a look around online for how to do this and this is the closest I came; I tried to split the vial column like this...

 > a=strsplit(day.df$vial,"")
 > a[1] "Xm1.2"

but had problems when the "line" section of the string went >9 because then two character were there, e.g (for the row where vial is Af20.2).

 > a[300]
 [[1]]
 [1] "A" "f" "2" "0" "." "2"

Should read as:

 > a[300]
 [[1]]
 [1] "A" "f" "20" "." "2"

So the steps I need help solving are:

Overcome the problem with the line section of the string when over 9.
Add the list of the split string to the day.df dataframe in the four required columns

653

asked Jul 05 '13 11:07

rg255

2 Answers

using gsub and strsplit like this :

v <- c('Xm1.1','Xf3.2')
h <- gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])','\\1|\\2|\\3|\\4',v)
do.call(rbind,strsplit(h,'[|]'))

    [,1] [,2] [,3] [,4]
[1,] "X"  "m"  "1"  "1" 
[2,] "X"  "f"  "3"  "2"

the result it is a data.frame, you can cbind it to your original data.frame.

EDIT @GriffinEvo Applied & tested code:

 a = gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])',
           '\\1|\\2|\\3|\\4',day.df$vial) 

 do.call(rbind, strsplit(a,'[|]') )
 day.df = cbind(day.df,do.call(rbind,strsplit(a,'[|]'))) 
 colnames(day.df)[4:7] = c ("treatment" , "gender" , "line" , "block")

170

answered Nov 20 '22 06:11

agstudy

Read the data:

Lines <- "vial    response explanatory
    Xm1.1   0        4
    Xm2.1   0        4
    Xm3.1   0        4
    Xm4.1   0        4
"

day.df <- read.table(text = Lines, header = TRUE, as.is = TRUE)

1) then process it using strapplyc. (we used as.is=TRUE so that day.df$vial is character but if its a factor in your data frame then replace day.df$Vial with as.character(day.df$vial). ) This approach does the parsing in just one short line of code:

library(gsubfn)    
s <- strapplyc(day.df$vial, "(.)(.)(\\d+)[.](.)", simplify = rbind)

# we can now cbind it to the original data frame
colnames(s) <- c("treatment", "gender", "line", "block")
cbind(day.df, s)

which gives:

  vial response explanatory treatment gender line block
1 Xm1.1        0           4         X      m    1     1
2 Xm2.1        0           4         X      m    2     1
3 Xm3.1        0           4         X      m    3     1
4 Xm4.1        0           4         X      m    4     1

2) Here is a different approach. This does not use any packages and is relatively simple (no regular expressions at all) and only involves one R statement including the cbind'ing:

transform(day.df,
 treatment = substring(vial, 1, 1),        # 1st char
 gender = substring(vial, 2, 2),           # 2nd char
 line = substring(vial, 3, nchar(vial)-2), # 3rd through 2 prior to last char
 block = substring(vial, nchar(vial)))     # last char

The result is as before.

UPDATE: Added second approach.

UPDATE: Some simplifications.

answered Nov 20 '22 06:11

G. Grothendieck

Related questions
                            
                                how can I convert a dictionary to a string of keyword arguments?
                            
                                SDL + SDL_ttf: Transparent blended text?
                            
                                python string splitting
                            
                                Extract HTML from URL
                            
                                How can I count the number of occurrences of a simple pattern in a string?
                            
                                String problems with Javascript double quotes inside single
                            
                                Issue with 'StringVar' in Python Program
                            
                                python string good practise: ' vs " [duplicate]
                            
                                Convert a String to a Type Constructor in Haskell
                            
                                How to format an Integer to a four-zero-left string?
                            
                                Python: How to prepend the string 'ub' to every pronounced vowel in a string?
                            
                                Unicode, regular expressions and PyPy
                            
                                String Pool: "Te"+"st" faster than "Test"?
                            
                                replace() and replaceAll() in Java
                            
                                Use of PHP built-in ltrim() to remove a single character
                            
                                Not reading a string properly
                            
                                Extract all words between two specific words in a character vector
                            
                                C# compare string ignoreCase
                            
                                how to handle %20 while string comparison in c#
                            
                                How to remove only html tags in a string using javascript

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With