Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

vectorization vs. parallelization in R

As a toy example, suppose that we have a function called 'my_func' (the code is below) that takes two parameters 'n' and 'p'. Our function, 'my_func', will generate a random matrix 'x' with 'n' rows and 'p' columns and do something expensive in both runtime and memory usage, such as computing the sum of the singular values of 'x'. (Of course, the function is a one-liner, but I am shooting for readability here.)

my_func <- function(n, p) {
  x <- replicate(p, rnorm(n))
  sum(svd(x)$d)
}

If we wish to compute 'my_func' for several values of 'n', and for each value of 'n' we have several values of 'p', then vectorizing the function and then applying it the combinations to 'my_func' is straightforward:

n <- 10 * seq_len(5)
p <- 100 * seq_len(10)
grid <- expand.grid(n = n, p = p)
my_func <- Vectorize(my_func)
set.seed(42)
do.call(my_func, grid)
[1]   98.61785  195.50822  292.21575  376.79186  468.13570  145.18359
[7]  280.67456  421.03196  557.87138  687.75040  168.42994  340.42452
[13]  509.65528  683.69883  851.29063  199.08474  400.25584  595.18311
[19]  784.21508  982.34591  220.73215  448.23698  669.02622  895.34184
[25] 1105.48817  242.52422  487.56694  735.67588  976.93840 1203.25949

Notice that each call to 'my_func' can be painfully slow for large 'n' and 'p' (try n = 1000 and p = 2000 for starters).

Now, in my actual application with a similarly constructed function, the number of rows in 'grid' is much larger than given here. Hence, I am trying to understand vectorizing in R a little better.

First question: In the above example, are the calls to 'my_func' performed sequentially so that the the memory usage in one call is garbage collected before the next call? I use vectorization often but have never stopped to ask this question.

Second question: (This question may depend on the first) Assuming that the number of calls is large enough and that 'my_func' is slow enough, is parallelization warranted here? I am presuming yes. My real question is: is parallelization warranted here if instead 'my_func' had the same large matrix passed to it for each call? For sake of argument, assume the matrix is called 'y', has 1000 rows and 5000 columns and is calculated on-the-fly. Of course, passing the matrix 'y' to each of the parallel nodes will incur some lag.

I understand that the answer to the second question may be "It depends on..." If that is the case, please let me know, and I will try to give more detail.

Also, I appreciate any advice, feedback, or OMFG WTF N00B YOU HAVEN'T SEEN THIS OTHER OBSCURE SOMEWHAT RELEVANT DISCUSSION??!!!111oneone1

like image 914
ramhiser Avatar asked Feb 23 '23 20:02

ramhiser


1 Answers

The answer to the first question is pretty clearly yes: almost everything in R is by default serial. (A very few things internally start to use OpenMP, but R as an engine will likely remain single-threaded).

So for the second question: Yes, do try that. I don't use Vectorize() much, but I do like the *apply() family. Solve it with lapply(), then load the multicore package and use mclapply() to run it over as many cores as yo u have. Here is an example:

R> system.time(res <- lapply(1:nrow(grid), 
+                            function(i) my_func(grid[i,1],grid[i,2])))
   user  system elapsed 
  0.470   0.000   0.459 
R> system.time(res <- mclapply(1:nrow(grid), 
+                              function(i) my_func(grid[i,1], grid[i,2])))
   user  system elapsed 
  0.610   0.140   0.135 
R> 

Notice how elapsed time is now about 29% (= 0.135/0.459) of the original.

From here you can generalize further with parallel execution across several machines--the Task View on High-Performane Computing with R has further pointers. R 2.14.0 due October 31 will have a new package 'parallel' which combines parts of multicore and snow.

like image 116
Dirk Eddelbuettel Avatar answered Feb 25 '23 11:02

Dirk Eddelbuettel