Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using parallel processing in vegan functions?

Tags:

r

vegan

I am interested in executing the R function adonis from the vegan package in parallel. However, it isn't clear to me how exactly to make it run in parallel. Regardless of how I try to initialize it, it seems to take the same amount of time to execute. Can someone explain what I am doing wrong?

require(vegan)
require(parallel)
data(dune)
data(dune.env)
#This:
system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999))
#Runs faster (4.49 s) than this (6.7 s):
system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999, parallel=3))
#or this (6.7 s)
cl <- makeCluster(3)
system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999, parallel=cl))
stopCluster(cl)

Computer details:

  • R V4.0
  • Win 10x64
  • i5-8350 4 cores
like image 438
TBP Avatar asked Mar 06 '26 10:03

TBP


1 Answers

I'm not sure how helpful this answer will really be, but I'll share a few of my own observations and things I've slowly pieced together. I don't pretend to be an expert on this, so take my answer realizing there may be some inaccuracies in here. I'm a biologist first.

Some of these parallel libraries seem to reload the r-environment and run any start up files (e.g. rprofiles) you have per each core. So, there is an inherent time cost using the parallel libraries that makes it so that you will only see benefits to parallel functions if you it is a large enough computation to be worth the parallelization (in your example, the Dune dataset is really small. I'll share my own benchmarks below). That said, there are a few things that seem to help.

Using the doParallel library, you can specify arguments to not load unnecessary information into your session like so:

library(doParallel)
cl <- makeCluster(3, rscript_args = c("--no-init-file", "--no-site-file","--no-environ"))
#for linux   .... cl <- makePSOCKcluster(2)
registerDoParallel(cl)
unif_w = UniFrac(d, weighted=T, parallel=T, normalized = T)
unif_uw = UniFrac(d, weighted=F, parallel=T)
stopCluster(cl)

I noticed in my own work that the addition of the rscript option greatly enhanced my speeds (sorry, no benchmarks for this, hoping to get a quick anwer out). If I remember the source where I got that suggestion from I'll come back to share.

This doesn't help with running Adonis, however I think that initial time cost might explain why we don't see a time benefit using the parallel options built in to Adonis on the Dune dataset. Here are my benchmarks.

> data("dune")
> data("dune.env")
> system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999))
   user  system elapsed 
   3.90    0.00    3.93 
> #Runs faster (4.49 s) than this (6.7 s):
> system.time(adonis(dune ~ Management * A1, dune.env, perm = 99999, parallel=3))
   user  system elapsed 
   0.71    0.04    6.53 

Not a big difference on this set, but it IS slower in parallel. However, repeated with a large set I'm working with at the moment (bc is a distance matrix was calculated from species matrix that has 887 species by 3734 sites)

> system.time(adonis(bc ~ fmet$Diagnosis, parallel = 1))
   user  system elapsed 
 109.95   21.27  131.22 
> system.time(adonis(bc ~ fmet$Diagnosis, parallel = 4))
   user  system elapsed 
   3.44    1.41   82.36 

Long story short, in this specific case you might only benefits by applying the adonis option to a larger dataset.

I'm not sure how important computer specs are here, but I do have a large bit of memory intended for this kind of purpose. The memory in my case is more important for allowing me to work with large matrices a little easier.

  • R version: 4.0.2
  • Windows 10, 64bit
  • AMD Ryzen 3600
  • 64gb DRAM

Anyways, I'm still looking for other work-arounds and tricks.

like image 71
cdtip Avatar answered Mar 08 '26 00:03

cdtip



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!