Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel R on a Windows cluster

I've got a Windows HPC Server running with some nodes in the backend. I would like to run Parallel R using multiple nodes from the backend. I think Parallel R might be using SNOW on Windows, but not too sure about it. My question is, do I need to install R also on the backend nodes? Say I want to use two nodes, 32 cores per node:

cl <- makeCluster(c(rep("COMP01",32),rep("COMP02",32)),type="SOCK")

Right now, it just hangs.

What else do I need to do? Do the backend nodes need some kind of sshd running to be able to communicate each other?

like image 717
Manolete Avatar asked Jul 09 '13 12:07

Manolete


People also ask

Can R run in parallel?

Running R code in parallel can be very useful in speeding up performance. Basically, parallelization allows you to run multiple processes in your code simultaneously, rather than than iterating over a list one element at a time, or running a single process at a time.

Does doParallel work on Windows?

The multicore functionality supports multiple workers only on those operating systems that support the fork system call; this excludes Windows. By default, doParallel uses multicore functionality on Unix-like systems and snow functionality on Windows.

What is parallel R?

There are various packages in R which allow parallelization. “parallel” Package The parallel package in R can perform tasks in parallel by providing the ability to allocate cores to R. The working involves finding the number of cores in the system and allocating all of them or a subset to make a cluster.

Is Lapply parallel?

lapply is used to call this parallel. function four times now, instead of the single time it was called before. Each of the four invocations of lapply winds up calling kmeans , but each call to kmeans only does 25 starts instead of the full 100.


1 Answers

Setting up snow on a Windows cluster is rather difficult. Each of the machines needs to have R and snow installed, but that's the easy part. To start a SOCK cluster, you would need an sshd daemon running on each of the worker machines, but you can still run into troubles, so I wouldn't recommend it unless you're good at debugging and Windows system administration.

I think your best option on a Windows cluster is to use MPI. I don't have any experience with MPI on Windows myself, but I've heard of people having success with the MPICH and DeinoMPI MPI distributions for Windows. Once MPI is installed on your cluster, you also need to install the Rmpi package from source on each of your worker machines. You would then create the cluster object using the makeMPIcluster function. It's a lot of work, but I think it's more likely to eventually work than trying to use a SOCK cluster due to the problems with ssh/sshd on Windows.

If you're desperate to run a parallel job once or twice on a Windows cluster, you could try using manual mode. It allows you to create a SOCK cluster without ssh:

workers <- c(rep("COMP01",32), rep("COMP02",32))
cl <- makeSOCKluster(workers, manual=TRUE)

The makeSOCKcluster function will prompt you to start each one of the workers, displaying the command to use for each. You have to manually open a command window on the specified machine and execute the specified command. It can be extremely tedious, particularly with many workers, but at least it's not complicated or tricky. It can also be very useful for debugging in combination with the outfile='' option.

like image 66
Steve Weston Avatar answered Sep 22 '22 11:09

Steve Weston