Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string in each column for several columns

Tags:

r

strsplit

I have this table (data1) with four columns

SNP rs6576700 rs17054099 rs7730126
sample1 G-G T-T G-G

I need to separate columns 2-4 into two columns each, so the new output have 7 columns. Like this :

SNP rs6576700 rs6576700 rs17054099 rs17054099 rs7730126 rs7730126
sample1 G G T T C C

With the following function I could split all columns at the time but the output is not what I need.

split <- function(x){
    x <- as.character(x)
    strsplit(as.character(x), split="-")
  }

data2=apply(data1[,-1], 2, split)

data2
$rs17054099
$rs17054099[[1]]
[1] "T" "T"


$rs7730126
$rs7730126[[1]]
[1] "G" "G"


$rs6576700
$rs6576700[[1]]
[1] "C" "C"

In Stack Overflow I found a method to convert the output of strsplit to a dataframe but the rs numbers are in rows not in columns (I got a similar output with other methods in this thread strsplit by row and distribute results by column in data.frame)

> n <- max(sapply(data2, length))
> l <- lapply(data2, function(X) c(X, rep(NA, n - length(X))))
> data.frame(t(do.call(cbind, l)))
           t.do.call.cbind..l..
rs17054099                 T, T
rs7730126                  G, G
rs2061700                  C, C

If I do not use the function transpose (...(t(do.call...), the output is a list that I cannot write to a file.

I would like to have the solution in R to make it part of a pipeline.

I forgot to say that I need to apply this to a million columns.

like image 317
Sami Avatar asked Aug 13 '15 21:08

Sami


People also ask

How do I split a string into multiple columns in R?

To split a column into multiple columns in the R Language, We use the str_split_fixed() function of the stringr package library. The str_split_fixed() function splits up a string into a fixed number of pieces.


1 Answers

This is straight forward using the splitstackshape::cSplit function. Just specify the column indices within the splitCols parameter, and the separator within to the sep parameter, and you done. It will even number your new column names so you will be able to distinguish between them. I've specified type.convert = FALSE so T values won't become TRUE. The default direction is wide, so you don't need to specify it.

library(splitstackshape)
cSplit(data1, 2:4, sep = "-", type.convert = FALSE)
#        SNP rs6576700_1 rs6576700_2 rs17054099_1 rs17054099_2 rs7730126_1 rs7730126_2
# 1: sample1           G           G            T            T           G           G

Here's a solution as per the provided link using the tstrsplit function for the devel version of data.table on GH. in here, we will define the index by subletting the column names first, and then we will number them using paste The is a bit more cumbersome approach but its advantage is that it will update your original data in place instead of create a copy of the whole data

library(data.table) ## V1.9.5+
indx <- names(data1)[2:4]
setDT(data1)[, paste0(rep(indx, each = 2), 1:2) := sapply(.SD, tstrsplit, "-"), .SDcols = indx]
data1
#        SNP rs6576700 rs17054099 rs7730126 rs65767001 rs65767002 rs170540991 rs170540992 rs77301261 rs77301262
# 1: sample1       G-G        T-T       G-G          G          G           T           T          G          G
like image 111
David Arenburg Avatar answered Oct 23 '22 15:10

David Arenburg