Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using packages with multi-threading in R

I need to multi-thread my R application as it takes 5 minutes to run and is only using 15% of the computers available CPU.

An example of a process which takes a while to run is calculating the mean of a very large raster stack containing n layers:

mean = cellStats(raster_layers[[n]], stat='sd', na.rm=TRUE)

Using the parallel library, I can create a new cluster and pass a function to it:

cl <- makeCluster(8, type = "SOCK")
parLapply(cl, raster_layers[[1]], mean_function)
stopCluster(cl)

where mean function is:

mean_function <- function(raster_object)
{
result = cellStats(raster_object, stat='mean', na.rm=TRUE)
return(result)
}

This method works fine except that it can't see the 'raster' package which is required to use cellStats. So it fails saying no function for cellStats. I have tried including the library within the function but this doesnt help.

The raster package comes with a cluster function, and it CAN see the function cellStats, however as far as I can tell, the cluster function must return a raster object and must be passed a single raster object which isn't flexible enough for me, I need to be able to pass a list of objects and return a numeric variable... which I can do with normal clustering using the parallel library if only it can see the raster package functions.

So, does anybody know how I can pass a package to a node with multi-threading in R? Or, how I can return a single value from the raster cluster function perhaps?

like image 417
Single Entity Avatar asked Apr 13 '15 12:04

Single Entity


1 Answers

The solution came from Ben Barnes, thank you.

The following code works fine:

mean_function <- function(variable)
{
result = cellStats(variable, stat='mean', na.rm=TRUE)
return(result)
}

cl <- makeCluster(procs, type = "SOCK")
clusterEvalQ(cl, library(raster))   
result = parLapply(cl, a_list, mean_function)
stopCluster(cl)

Where procs is the number of processors you wish to use, which must be the same value as the length of the list you are passing (in this case called a_list).

a_list in this case needs to be a list containing rasters which can be operated on to calculate the mean using the cellStats function. So, a_list is simply a list of rasters, containing procs number of rasters.

like image 86
Single Entity Avatar answered Oct 23 '22 04:10

Single Entity