Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parallelise an algorithm that includes a sparse matrix, in R

I have been using the library 'doParallel' in R to increase the speed of a set of functions. However, I have run into an error that I cannot solve. I believe the following code isolates the pith of the problem:

library(Matrix)
library(doParallel)

test_mat = Matrix(c(0,1,2,NA,0,0,2,NA,1,NA,1,2,2,NA,0,1,0,2,2,2,0,0,NA,NA,1,2,1,1,2,1,rep(NA,5)), ncol=7, byrow=TRUE, sparse=TRUE)

par_func <- function(mat, ncores)
{
  cl <- makePSOCKcluster(ncores)
  clusterSetRNGStream(cl) 
  registerDoParallel(cl, cores = ncores)

  df = data.frame(1:7, NA)

  temp_vec = foreach(i=iter(df, by='row'), .combine=rbind) %dopar%
  {
    i[,2] <- sum(mat[,i[,1]] == 1, na.rm = TRUE) + 1
  }
  stopCluster(cl)
  return(temp_vec)
}

par_func(mat=test_mat, ncores=5)

Which produces the following error message:

Error in { : task 1 failed - "object of type 'S4' is not subsettable" 

This function works if 'mat' is of class 'matrix' rather than 'dgCMatrix' so the problem appears to be due to the subsetting of a sparse Matrix. Do I have any options to work around this problem? The matrix 'mat' can be very large and can comprise many zeros, so I would like to continue to work with sparse matrices.

like image 641
joel38237 Avatar asked Jun 13 '14 13:06

joel38237


1 Answers

The fundamental problem is that the workers haven't loaded the Matrix package, so they don't know how to subset the Matrix object "mat". You can fix that with the foreach .packages option:

temp_vec = foreach(i=iter(df, by='row'), .packages='Matrix', .combine=rbind) %dopar% {
  # snip
}

Note that your example fails on all platforms, but if you were to register doParallel with:

registerDoParallel(4)

then your foreach loop would have worked on Linux and Mac OS X, but failed on Windows! The reason is that on Linux and Mac OS X, the mclapply function would be used, but on Windows, a cluster object would be implicitly created, and then the clusterApplyLB function would be used. The workers are forked by mclapply, so they inherit the parent's environment, including the loaded packages, and thus the foreach loop works. But the environment isn't inherited by the workers when using makePSOCKcluster, so you have to initialize the environment of the workers using things like the .packages option, otherwise the foreach loop fails. It's ironic that since the doParallel package hides this difference in order to make things easier, it sets up a little portability trap for Windows users.

There are other ways in which this example can be improved (as mentioned by @agstudy), but as I said, the fundamental problem is that the Matrix package isn't loaded on the workers.

like image 73
Steve Weston Avatar answered Oct 13 '22 04:10

Steve Weston