I have a df (day.df
) with the column vial
which I am trying to split in to four new columns. The new columns will be treatment
gender
line
block
. The day.df
dataframe also has the columns response
& explanatory
which will be retained.
So day.df
currently looks like this (top 4 of 31000 rows):
vial response explanatory
Xm1.1 0 4
Xm2.1 0 4
Xm3.1 0 4
Xm4.1 0 4
. . .
. . .
. . .
The current contents of the vial
column look like this.. Xm1.2
.
X
or A
- this will be the
treament
. m
) can be m
or
f
- this is the gender
.1
) and ranges from 1
-40
- this
is the line
.block
and ranges from 1
-4
As such the new day.df
will look something like this (I use four "random" rows to illustrate the variation within each new column):
vial response explanatory treatment gender line block
Xm1.1 0 4 X m 1 1
Am1.1 0 4 A m 1 1
Xf3.2 0 4 X f 3 2
Xm4.2 0 4 X m 4 2
. . .
. . .
. . .
I've taken a look around online for how to do this and this is the closest I came; I tried to split the vial
column like this...
> a=strsplit(day.df$vial,"")
> a[1] "Xm1.2"
but had problems when the "line" section of the string went >9 because then two character were there, e.g (for the row where vial
is Af20.2
).
> a[300]
[[1]]
[1] "A" "f" "2" "0" "." "2"
Should read as:
> a[300]
[[1]]
[1] "A" "f" "20" "." "2"
So the steps I need help solving are:
line
section of the string when over 9.day.df
dataframe in the four required columnsMethod 1: Using strsplit() function strsplit() function is used to split the string based on some condition.
To split a column into multiple columns in the R Language, We use the str_split_fixed() function of the stringr package library. The str_split_fixed() function splits up a string into a fixed number of pieces.
using gsub
and strsplit
like this :
v <- c('Xm1.1','Xf3.2')
h <- gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])','\\1|\\2|\\3|\\4',v)
do.call(rbind,strsplit(h,'[|]'))
[,1] [,2] [,3] [,4]
[1,] "X" "m" "1" "1"
[2,] "X" "f" "3" "2"
the result it is a data.frame, you can cbind
it to your original data.frame.
EDIT @GriffinEvo Applied & tested code:
a = gsub('(X|A)(m|f)([0-9]{1,2})[.]([1-4])',
'\\1|\\2|\\3|\\4',day.df$vial)
do.call(rbind, strsplit(a,'[|]') )
day.df = cbind(day.df,do.call(rbind,strsplit(a,'[|]')))
colnames(day.df)[4:7] = c ("treatment" , "gender" , "line" , "block")
Read the data:
Lines <- "vial response explanatory
Xm1.1 0 4
Xm2.1 0 4
Xm3.1 0 4
Xm4.1 0 4
"
day.df <- read.table(text = Lines, header = TRUE, as.is = TRUE)
1) then process it using strapplyc
. (we used as.is=TRUE
so that day.df$vial
is character but if its a factor
in your data frame then replace day.df$Vial
with as.character(day.df$vial)
. ) This approach does the parsing in just one short line of code:
library(gsubfn)
s <- strapplyc(day.df$vial, "(.)(.)(\\d+)[.](.)", simplify = rbind)
# we can now cbind it to the original data frame
colnames(s) <- c("treatment", "gender", "line", "block")
cbind(day.df, s)
which gives:
vial response explanatory treatment gender line block
1 Xm1.1 0 4 X m 1 1
2 Xm2.1 0 4 X m 2 1
3 Xm3.1 0 4 X m 3 1
4 Xm4.1 0 4 X m 4 1
2) Here is a different approach. This does not use any packages and is relatively simple (no regular expressions at all) and only involves one R statement including the cbind'ing:
transform(day.df,
treatment = substring(vial, 1, 1), # 1st char
gender = substring(vial, 2, 2), # 2nd char
line = substring(vial, 3, nchar(vial)-2), # 3rd through 2 prior to last char
block = substring(vial, nchar(vial))) # last char
The result is as before.
UPDATE: Added second approach.
UPDATE: Some simplifications.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With