Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

foreach %dopar% uses sequential worker setup with PSock cluster?

Tags:

r

Question

I've noticed that foreach/%dopar% performs sequential, not parallel setup of a cluster prior to executing tasks in parallel. If each worker requires a dataset and it takes N seconds to transfer the dataset to the worker, then foreach/%dopar% spends #workers * N seconds of setup time. This can be significant for large # of workers or a large N (large datasets to transfer).

My question is whether this is by design or is there some parameter/setting that I'm missing in foreach or perhaps in cluster generation?

Setup

  • R 2.15.2
  • latest versions of foreach/parallel/doParallel as of today (1/7/2013)
  • Windows 7 x64

Example

library( foreach )
library( parallel )
library( doParallel )

# lots of data
data = eval( rnorm( 100000000 ) )

# make cluster/register - creates 6 nodes fairly quickly
cluster = makePSOCKcluster( 6 , outfile = "" )
registerDoParallel( cluster  )

# fire up Task Manager.  Observer that each node recieves data sequentially.
# When last node gets data, then all nodes process at the same time
results = foreach( i = 1 : 500 )  %dopar%
{
    print( data[ i ] )
    return( data[ i ] )
}
like image 547
SFun28 Avatar asked Jan 07 '13 16:01

SFun28


People also ask

What is a psock psock cluster?

Note that by default the makeCluster makeCluster function creates a PSOCK PSOCK cluster, which is an enhanced version of the SOCK SOCK cluster implemented in the snow snow package. Accordingly, the PSOCK PSOCK cluster is a pool of worker processes that exchange data with the master process via sockets.

How to launch a parallel worker from a cluster using SSH?

When the HPC environment ## does not support SSH between compute nodes, one can use the 'pjrsh' ## command to launch the parallel workers. cl <- makeClusterPSOCK ( availableWorkers (), rshcmd = "pjrsh", dryrun = TRUE, quiet = TRUE )

How do I start a worker sequentially on other clusters?

Workers will be started sequentially on other clusters, on all clusters with setup_strategy = "sequential"and on R3.6.0 and older. This option is for expert use only (e.g. debugging) and may be removed in future versions of R.

How do I prepare a cluster for parallel execution?

The resulting object must be a two-column matrix with the first column representing means, and the second column describing variances (the number of rows must be equal to the number of files). Repeat the actions listed in Exercise 8 to prepare a cluster for parallel execution, then run the modified code in parallel.


1 Answers

Thanks to Rich at Revolution Computing for helping with this one....

clusterCall uses a for loop to send data to each worker. Because R is not multi-threaded the for loop must be sequential.

There are a few solutions (which would require someone to code them up). R could call out to C/C++ to thread the worker setup. Or the workers could pull the data from a file on disk. Or the workers could listen on the same socket and the master could write to the socket just once and have the data broadcast to all workers.

like image 77
SFun28 Avatar answered Oct 18 '22 16:10

SFun28