Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using multicore in R to analyse GWAS data

I am using R to analyze genome-wide association study data. I have about 500,000 potential predictor variables (single-nucleotide polymorphisms, or SNPs) and want to test the association between each of them and a continuous outcome (in this case low-density lipoprotein concentration in the blood).

I have already written a script that does this without problem. To briefly explain, I have a data object, called "Data". Each row corresponds to a particular patient in the study. There are columns for age, gender, body mass index (BMI), and blood LDL concentration. There are also half a million other columns with the SNP data.

I am currently using a for loop to run the linear model half a million times, as shown:

# Repeat loop half a million times
for(i in 1:500000) {

# Select the appropriate SNP
SNP <- Data[i]

# For each iteration, perform linear regression adjusted for age, gender, and BMI and save the result in an object called "GenoMod"
GenoMod  <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data)

# For each model, save the p value and error for each SNP. I save these two data points in columns 1 and 2 of a matrix called "results"
results[i,1] <- summary(GenoMod)$coefficients["Geno","Pr(>|t|)"]
results[i,2] <- summary(GenoMod)$coefficients["Geno","Estimate"]
}

All of that works fine. However, I would really like to speed up my analysis. I've therefore been experimenting with the multicore, DoMC, and foreach packages.

My question is, could someone please help me adapt this code using the foreach scheme?

I am running the script on a Linux server that apparently has 16 cores available. I've tried experimenting with the foreach package, and my results using it have been comparatively worse, meaning that it takes longer to run the analysis using foreach.

For example, I've tried saving the linear model objects as shown:

library(doMC)
registerDoMC()
results <- foreach(i=1:500000) %dopar% { lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data) }

This takes more than twice as long as using just a regular for loop. Any advice on how to do this better or more quickly would be appreciated! I understand that using the parallel version of lapply might be an option, but don't know how to do this either.

All the best,

Alex

like image 321
Alexander Avatar asked Dec 13 '11 11:12

Alexander


1 Answers

To give you a startup: If you use Linux, you can do the multicore approach contained within the parallel package. Whereas you needed to set up the whole thing when using eg the foreach package, that's not necessary any more with this approach. Your code would be run on 16 cores by simply doing :

require(parallel)

mylm <- function(i){
  SNP <- Data[i]
  GenoMod  <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data)
  #return the vector
  c(summary(GenoMod)$coefficients["Geno","Pr(>|t|)"],
    summary(GenoMod)$coefficients["Geno","Estimate"])
}

Out <- mclapply(1:500000, mylm,mc.cores=16) # returns list
Result <- do.call(rbind,Out) # make list a matrix

Here you make a function that returns a vector with the wanted quantities, and apply the indices over this. I couldn't check this though as I don't have access to the data, but it should work.

like image 125
Joris Meys Avatar answered Sep 28 '22 05:09

Joris Meys