Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using the split function to group a dataframe by factor, alternatives for large dataframes

Tags:

split

r

I have a question regarding using the split function to group data by factor.

I have a data frame of two columns snps and gene. Snps is a factor, gene is a character vector. I want to group genes by the snp factor so I can see a list of genes mapping to each snp. Some snps may map to more than one gene, for example rs10000226 maps to gene 345274 and gene 5783, and genes occur multiple times.

To do this I used the split function to make a list of genes each snp maps to.

snps<-c("rs10000185", "rs1000022", "rs10000226", "rs10000226")

gene<-c("5783", "171425", "345274", "5783")

df<-data.frame(snps, gene)  # snps is a factor

df$gene<-as.character(df$gene)

splitted=split(df, df$gene, drop=T) # group by gene

snpnames=unique(df$snps)

df.2<-lapply(splitted, function(x) { x["snps"] <- NULL; x })   # remove  the snp column

names(df.2)=snpnames    # rename the list elements by snp

df.2 = sapply(df.2, function(x) list(as.character(x$gene)))

save(df.2, file="df.2.rda")

However this is not effective for my full dataframe (probably due to its size - 363422 rows, 281370 unique snps, 20888 unique genes) and R crashes whilst trying to load df.2.rda` later on.

Any suggestions for alternative ways to do this would be much appreciated!

like image 485
avari Avatar asked Oct 31 '22 05:10

avari


1 Answers

There is a shorter way to create your df.2:

genes_by_snp <- split(df$gene,df$snp)

You can look at the genes for a given snp with genes_by_snp[["rs10000226"]].


Your data set does not sound so big to me, but you could avoid creating the list above by storing your original data differently. Expanding on @AnandoMahto's comment, here's how to use the data.table package:

require(data.table)

setDT(df)
setkey(df,snps)

You can look at the genes for a given snp with df[J("rs10000226")].

like image 86
Frank Avatar answered Nov 15 '22 05:11

Frank