When working with genomic array data, a 'probe' is often assigned to different genes (different transcripts). Object df
shows an example of this.
df <- data.frame(c("geneA;geneB;geneB", "geneG", "geneC;geneD"))
colnames(df) <- "gene.names"
df#looks like this:
gene.names
1 geneA;geneB;geneB
2 geneG
3 geneC;geneD
I would like to split all elements in df$gene.names
at ;
and put each substring in a new column. NA
can be used if there is no more genes in a row.
This script works, but I think most people will agree that this a greedy code and not too efficient. Can someone suggest a better alternative?
library(plyr)#load this library first
out <- NULL
for (i in 1:NROW(df)){
one <- as.data.frame(t(as.data.frame(strsplit(as.character(df[i,1]), ";"))))
out <- rbind.fill(out, one)
}
out#looks like this:
V1 V2 V3
1 geneA geneB geneB
2 geneG <NA> <NA>
3 geneC geneD <NA>
I recommend using splitstackshape
for this:
splitstackshape::cSplit(df, splitCols="gene.names", sep=";")
gene.names_1 gene.names_2 gene.names_3
1: geneA geneB geneB
2: geneG NA NA
3: geneC geneD NA
Here is a base R
option with read.table
read.table(text= as.character(df$gene.names), sep=";",
header=FALSE, stringsAsFactors=FALSE, fill=TRUE,na.strings="")
# V1 V2 V3
#1 geneA geneB geneB
#2 geneG <NA> <NA>
#3 geneC geneD <NA>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With