I have a question regarding using the split
function to group data by factor
.
I have a data frame of two columns snps and gene. Snps is a factor, gene
is a character vector. I want to group genes by the snp factor so I can see a list of genes mapping to each snp. Some snps may map to more than one gene, for example rs10000226 maps to gene 345274 and gene 5783, and genes occur multiple times.
To do this I used the split function to make a list of genes each snp maps to.
snps<-c("rs10000185", "rs1000022", "rs10000226", "rs10000226")
gene<-c("5783", "171425", "345274", "5783")
df<-data.frame(snps, gene) # snps is a factor
df$gene<-as.character(df$gene)
splitted=split(df, df$gene, drop=T) # group by gene
snpnames=unique(df$snps)
df.2<-lapply(splitted, function(x) { x["snps"] <- NULL; x }) # remove the snp column
names(df.2)=snpnames # rename the list elements by snp
df.2 = sapply(df.2, function(x) list(as.character(x$gene)))
save(df.2, file="df.2.rda")
However this is not effective for my full dataframe (probably due to its size - 363422 rows, 281370 unique snps, 20888 unique genes) and R crashes whilst trying to load df.2.rda` later on.
Any suggestions for alternative ways to do this would be much appreciated!
There is a shorter way to create your df.2
:
genes_by_snp <- split(df$gene,df$snp)
You can look at the genes for a given snp with genes_by_snp[["rs10000226"]]
.
Your data set does not sound so big to me, but you could avoid creating the list above by storing your original data differently. Expanding on @AnandoMahto's comment, here's how to use the data.table
package:
require(data.table)
setDT(df)
setkey(df,snps)
You can look at the genes for a given snp with df[J("rs10000226")]
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With