Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Asynchronous command dispatch in interactive R

I'm wondering if this is possible to do (it probably isn't) using one of the parallel processing backends in R. I've tried a few google searches and come up with nothing.

The general problem I have at the moment:

  • I have some large objects that take about half an hour to load
  • I want to generate a series of plots on the data (takes a few minutes).
  • I want to go and do other things with the data while this happens (Not changing the underlying data though!)

Ideally I would be able to dispatch the command from the interactive session, and not have to wait for it to return (so I can go do other things while I wait for the plot to render). Is this possible, or is this a case of wishful thinking?

like image 972
Scott Ritchie Avatar asked Sep 16 '13 02:09

Scott Ritchie


1 Answers

To expand on Dirk's answer, I suggest that you use the "snow" API in the parallel package. The mcparallel function would seem to be perfect for this (if you're not using Windows), but it doesn't work well for performing graphic operations due to it's use of fork. The problem with the "snow" API is that it doesn't officially support asynchronous operations. However, it's rather easy to do if you don't mind cheating by using non-exported functions. If you look at the code for clusterCall, you can figure out how to submit tasks asynchronously:

> library(parallel)
> clusterCall
function (cl = NULL, fun, ...) 
{
    cl <- defaultCluster(cl)
    for (i in seq_along(cl)) sendCall(cl[[i]], fun, list(...))
    checkForRemoteErrors(lapply(cl, recvResult))
}

So you just use sendCall to submit a task, and recvResult to wait for the result. Here's an example of that using the bigmemory package, as suggested by Dirk.

You can create a "big matrix" using functions such as big.matrix or as.big.matrix. You'll probably want to do that efficiently, but I'll just convert a matrix z using as.big.matrix:

library(bigmemory)
big <- as.big.matrix(z)

Now I'll create a cluster and connect each of the workers to big using describe and attach.big.matrix:

cl <- makePSOCKcluster(2)
worker.init <- function(descr) {
  library(bigmemory)
  big <<- attach.big.matrix(descr)
  X11()  # use "quartz()" on a Mac; "windows()" on Windows
  NULL
}
clusterCall(cl, worker.init, describe(big))

This also opens graphics window on each worker in addition to attaching to the big matrix.

To call persp on the first cluster worker, we use sendCall:

parallel:::sendCall(cl[[1]], function() {persp(big[]); NULL}, list())

This returns almost immediately, although it may take awhile until the plot appears. At this point, you can submit tasks to the other cluster worker, or do something else that is completely unrelated. Just make sure that you read the result before submitting another task to the same worker:

r1 <- parallel:::recvResult(cl[[1]])

Of course, this is all very error prone and not at all pretty, but you could write some functions to make it easier. Just keep in mind that non-exported functions such as these can change with any new release of R.

Note that it is perfectly possible and legitimate to execute a task on a specific worker or set of workers by subsetting the cluster object. For example:

clusterEvalQ(cl[1], persp(big[]))

This will send the task to the first worker while the others do nothing. But of course, this is synchronous, so you can't do anything on the other cluster workers until this task finishes. The only way that I know to send the tasks asynchronously is to cheat.

like image 98
Steve Weston Avatar answered Sep 20 '22 11:09

Steve Weston