Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Whether to use the detectCores function in R to specify the number of cores for parallel processing?

Tags:

In the help for detectCores() it says:

This is not suitable for use directly for the mc.cores argument of mclapply nor specifying the number of cores in makeCluster. First because it may return NA, and second because it does not give the number of allowed cores.

However, I've seen quite a bit of sample code like the following:

library(parallel) k <- 1000 m <- lapply(1:7, function(X) matrix(rnorm(k^2), nrow=k))  cl <- makeCluster(detectCores() - 1, type = "FORK") test <- parLapply(cl, m, solve) stopCluster(cl) 

where detectCores() is used to specify the number of cores in makeCluster.

My use cases involve running parallel processing both on my own multicore laptop (OSX) and running it on various multicore servers (Linux). So, I wasn't sure whether there is a better way to specify the number of cores or whether perhaps that advice about not using detectCores was more for package developers where code is meant to run over a wide range of hardware and OS environments.

So in summary:

  • Should you use the detectCores function in R to specify the number of cores for parallel processing?
  • What is the distinction mean between detected and allowed cores and when is it relevant?
like image 432
Jeromy Anglim Avatar asked Mar 10 '15 02:03

Jeromy Anglim


People also ask

How do I enable parallel processing in R?

The most straightforward way to enable parallel processing is by switching from using lapply to mclapply . (Note I'm using system. time instead of profvis here because I only care about running time, not profiling.)

How do I use multiple cores in R?

If you are on a single host, a very effective way to make use of these extra cores is to use several R instances at the same time. The operating system will indeed always assign a different core to each new R instance. In Linux, just open several the terminal windows. Then within each terminal, type R to open R.

Does Lapply run in parallel?

The parallel library, which comes with R as of version 2.14. 0, provides the mclapply() function which is a drop-in replacement for lapply. The "mc" stands for "multicore," and as you might gather, this function distributes the lapply tasks across multiple CPU cores to be executed in parallel.


2 Answers

I think it's perfectly reasonable to use detectCores as a starting point for the number of workers/processes when calling mclapply or makeCluster. However, there are many reasons that you may want or need to start fewer workers, and even some cases where you can reasonably start more.

On some hyperthreaded machines it may not be a good idea to set mc.cores=detectCores(), for example. Or if your script is running on an HPC cluster, you shouldn't use any more resources than the job scheduler has allocated to your job. You also have to be careful in nested parallel situations, as when your code may be executed in parallel by a calling function, or you're executing a multithreaded function in parallel. In general, it's a good idea to run some preliminary benchmarks before starting a long job to determine the best number of workers. I usually monitor the benchmark with top to see if the number of processes and threads makes sense, and to verify that the memory usage is reasonable.

The advice that you quoted is particularly appropriate for package developers. It's certainly a bad idea for a package developer to always start detectCores() workers when calling mclapply or makeCluster, so it's best to leave the decision up to the end user. At least the package should allow the user to specify the number of workers to start, but arguably detectCores() isn't even a good default value. That's why the default value for mc.cores changed from detectCores() to getOptions("mc.cores", 2L) when mclapply was included in the parallel package.

I think the real point of the warning that you quoted is that R functions should not assume that they own the whole machine, or that they are the only function in your script that is using multiple cores. If you call mclapply with mc.cores=detectCores() in a package that you submit to CRAN, I expect your package will be rejected until you change it. But if you're the end user, running a parallel script on your own machine, then it's up to you to decide how many cores the script is allowed to use.

like image 192
Steve Weston Avatar answered Sep 20 '22 06:09

Steve Weston


Author of the parallelly package here: The parallelly::availableCores() function acknowledges various HPC environment variables (e.g. NSLOTS, PBS_NUM_PPN, and SLURM_CPUS_PER_TASK) and system and R settings that are used to specify the number of cores available to the process, and if not specified, it'll fall back to parallel::detectCores(). As I, or others, become aware of more settings, I'll be happy to add automatic support also for those; there is an always open GitHub issue for this over at https://github.com/HenrikBengtsson/parallelly/issues/17 (there are some open requests for help).

Also, if the sysadm sets environment variable R_PARALLELLY_AVAILABLECORES_FALLBACK=1 sitewide, then parallelly::availableCores() will return 1, unless explicitly overridden by other means (by the job scheduler, by the user settings, ...). This further protects against software tools taking over all cores by default.

In other words, if you use parallelly::availableCores() rather than parallel::detectCores() you can be fairly sure that your code plays nice in multi-tenant environments (if it turns out it's not enough, please let us know in the above GitHub issue) and that any end user can still control the number of cores without you having to change your code.

EDIT 2021-07-26: availableCores() was moved from future to parallelly in October 2020. For now and for backward compatible reasons, availableCores() function is re-exported by the 'future' package.

like image 44
HenrikB Avatar answered Sep 18 '22 06:09

HenrikB