I have this table (data1) with four columns <pre class="prettyprint"><code>SNP rs6576700 rs17054099 rs7730126 sample1 G-G T-T G-G </code></pre> I need to separate columns 2-4 into two columns each, so the new output have 7 columns. Like this : <pre class="prettyprint"><code>SNP rs6576700 rs6576700 rs17054099 rs17054099 rs7730126 rs7730126 sample1 G G T T C C </code></pre> With the following function I could split all columns at the time but the output is not what I need. <pre class="prettyprint"><code>split <- function(x){ x <- as.character(x) strsplit(as.character(x), split="-") } data2=apply(data1[,-1], 2, split) data2 $rs17054099 $rs17054099[[1]] [1] "T" "T" $rs7730126 $rs7730126[[1]] [1] "G" "G" $rs6576700 $rs6576700[[1]] [1] "C" "C" </code></pre> In Stack Overflow I found a method to convert the output of strsplit to a dataframe but the rs numbers are in rows not in columns (I got a similar output with other methods in this thread strsplit by row and distribute results by column in data.frame) <pre class="prettyprint"><code>> n <- max(sapply(data2, length)) > l <- lapply(data2, function(X) c(X, rep(NA, n - length(X)))) > data.frame(t(do.call(cbind, l))) t.do.call.cbind..l.. rs17054099 T, T rs7730126 G, G rs2061700 C, C </code></pre> If I do not use the function transpose (...(t(do.call...), the output is a list that I cannot write to a file. I would like to have the solution in R to make it part of a pipeline. I forgot to say that I need to apply this to a million columns.

This is straight forward using the <code>splitstackshape::cSplit</code> function. Just specify the column indices within the <code>splitCols</code> parameter, and the separator within to the <code>sep</code> parameter, and you done. It will even number your new column names so you will be able to distinguish between them. I've specified <code>type.convert = FALSE</code> so <code>T</code> values won't become <code>TRUE</code>. The default direction is <code>wide</code>, so you don't need to specify it. <pre class="prettyprint"><code>library(splitstackshape) cSplit(data1, 2:4, sep = "-", type.convert = FALSE) # SNP rs6576700_1 rs6576700_2 rs17054099_1 rs17054099_2 rs7730126_1 rs7730126_2 # 1: sample1 G G T T G G </code></pre> <hr> Here's a solution as per the provided link using the <code>tstrsplit</code> function for the devel version of <code>data.table</code> on GH. in here, we will define the index by subletting the column names first, and then we will number them using <code>paste</code> The is a bit more cumbersome approach but its advantage is that it will update your original data in place instead of create a copy of the whole data <pre class="prettyprint"><code>library(data.table) ## V1.9.5+ indx <- names(data1)[2:4] setDT(data1)[, paste0(rep(indx, each = 2), 1:2) := sapply(.SD, tstrsplit, "-"), .SDcols = indx] data1 # SNP rs6576700 rs17054099 rs7730126 rs65767001 rs65767002 rs170540991 rs170540992 rs77301261 rs77301262 # 1: sample1 G-G T-T G-G G G T T G G </code></pre>

Split string in each column for several columns

Tags:

r

strsplit

I have this table (data1) with four columns

SNP rs6576700 rs17054099 rs7730126
sample1 G-G T-T G-G

I need to separate columns 2-4 into two columns each, so the new output have 7 columns. Like this :

SNP rs6576700 rs6576700 rs17054099 rs17054099 rs7730126 rs7730126
sample1 G G T T C C

With the following function I could split all columns at the time but the output is not what I need.

split <- function(x){
    x <- as.character(x)
    strsplit(as.character(x), split="-")
  }

data2=apply(data1[,-1], 2, split)

data2
$rs17054099
$rs17054099[[1]]
[1] "T" "T"


$rs7730126
$rs7730126[[1]]
[1] "G" "G"


$rs6576700
$rs6576700[[1]]
[1] "C" "C"

In Stack Overflow I found a method to convert the output of strsplit to a dataframe but the rs numbers are in rows not in columns (I got a similar output with other methods in this thread strsplit by row and distribute results by column in data.frame)

> n <- max(sapply(data2, length))
> l <- lapply(data2, function(X) c(X, rep(NA, n - length(X))))
> data.frame(t(do.call(cbind, l)))
           t.do.call.cbind..l..
rs17054099                 T, T
rs7730126                  G, G
rs2061700                  C, C

If I do not use the function transpose (...(t(do.call...), the output is a list that I cannot write to a file.

I would like to have the solution in R to make it part of a pipeline.

I forgot to say that I need to apply this to a million columns.

317

asked Aug 13 '15 21:08

Sami

1 Answers

This is straight forward using the splitstackshape::cSplit function. Just specify the column indices within the splitCols parameter, and the separator within to the sep parameter, and you done. It will even number your new column names so you will be able to distinguish between them. I've specified type.convert = FALSE so T values won't become TRUE. The default direction is wide, so you don't need to specify it.

library(splitstackshape)
cSplit(data1, 2:4, sep = "-", type.convert = FALSE)
#        SNP rs6576700_1 rs6576700_2 rs17054099_1 rs17054099_2 rs7730126_1 rs7730126_2
# 1: sample1           G           G            T            T           G           G

Here's a solution as per the provided link using the tstrsplit function for the devel version of data.table on GH. in here, we will define the index by subletting the column names first, and then we will number them using paste The is a bit more cumbersome approach but its advantage is that it will update your original data in place instead of create a copy of the whole data

library(data.table) ## V1.9.5+
indx <- names(data1)[2:4]
setDT(data1)[, paste0(rep(indx, each = 2), 1:2) := sapply(.SD, tstrsplit, "-"), .SDcols = indx]
data1
#        SNP rs6576700 rs17054099 rs7730126 rs65767001 rs65767002 rs170540991 rs170540992 rs77301261 rs77301262
# 1: sample1       G-G        T-T       G-G          G          G           T           T          G          G

111

answered Oct 23 '22 15:10

David Arenburg

Related questions
                            
                                Alternatives for for loops in R?
                            
                                Pass multiple variables and greek letters to ggtitle
                            
                                making a from, to network in three column data frame in r
                            
                                Reordering rows in a data.frame?
                            
                                R Shiny list2env
                            
                                Web scraping the make/model/year of VIN numbers in RStudio
                            
                                R - two data frame columns to list of key-value pairs
                            
                                Subset rows based on a specific threshold value
                            
                                Convert rows to one based on a common name [duplicate]
                            
                                Extract the level from a factor
                            
                                Multiple T-test in R
                            
                                Summary statistics in glmnet
                            
                                dcast without ID variables
                            
                                Prevent column name wrap in shiny DataTable
                            
                                Find a submatrix in a matrix
                            
                                un-intersect values in R
                            
                                Using variable in data.table group by clause
                            
                                R: set duplicate 'row.names' to a numeric data frame
                            
                                How do I use the addGeoJSON() feature in R for Leaflet?
                            
                                R - Counting the number of a specific value in bins

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With