I have this table (data1) with four columns
SNP rs6576700 rs17054099 rs7730126
sample1 G-G T-T G-G
I need to separate columns 2-4 into two columns each, so the new output have 7 columns. Like this :
SNP rs6576700 rs6576700 rs17054099 rs17054099 rs7730126 rs7730126
sample1 G G T T C C
With the following function I could split all columns at the time but the output is not what I need.
split <- function(x){
x <- as.character(x)
strsplit(as.character(x), split="-")
}
data2=apply(data1[,-1], 2, split)
data2
$rs17054099
$rs17054099[[1]]
[1] "T" "T"
$rs7730126
$rs7730126[[1]]
[1] "G" "G"
$rs6576700
$rs6576700[[1]]
[1] "C" "C"
In Stack Overflow I found a method to convert the output of strsplit to a dataframe but the rs numbers are in rows not in columns (I got a similar output with other methods in this thread strsplit by row and distribute results by column in data.frame)
> n <- max(sapply(data2, length))
> l <- lapply(data2, function(X) c(X, rep(NA, n - length(X))))
> data.frame(t(do.call(cbind, l)))
t.do.call.cbind..l..
rs17054099 T, T
rs7730126 G, G
rs2061700 C, C
If I do not use the function transpose (...(t(do.call...), the output is a list that I cannot write to a file.
I would like to have the solution in R to make it part of a pipeline.
I forgot to say that I need to apply this to a million columns.
To split a column into multiple columns in the R Language, We use the str_split_fixed() function of the stringr package library. The str_split_fixed() function splits up a string into a fixed number of pieces.
This is straight forward using the splitstackshape::cSplit
function. Just specify the column indices within the splitCols
parameter, and the separator within to the sep
parameter, and you done. It will even number your new column names so you will be able to distinguish between them. I've specified type.convert = FALSE
so T
values won't become TRUE
. The default direction is wide
, so you don't need to specify it.
library(splitstackshape)
cSplit(data1, 2:4, sep = "-", type.convert = FALSE)
# SNP rs6576700_1 rs6576700_2 rs17054099_1 rs17054099_2 rs7730126_1 rs7730126_2
# 1: sample1 G G T T G G
Here's a solution as per the provided link using the tstrsplit
function for the devel version of data.table
on GH. in here, we will define the index by subletting the column names first, and then we will number them using paste
The is a bit more cumbersome approach but its advantage is that it will update your original data in place instead of create a copy of the whole data
library(data.table) ## V1.9.5+
indx <- names(data1)[2:4]
setDT(data1)[, paste0(rep(indx, each = 2), 1:2) := sapply(.SD, tstrsplit, "-"), .SDcols = indx]
data1
# SNP rs6576700 rs17054099 rs7730126 rs65767001 rs65767002 rs170540991 rs170540992 rs77301261 rs77301262
# 1: sample1 G-G T-T G-G G G T T G G
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With