Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel predict

I am trying to run predict() in parallel on my Windows machine. This works on smaller dataset, but does not scale well as for each process new copy of data frame is created. Is there a way how to run in parallel without making temporary copies?

My code (only few modifications of this original code):

library(foreach)
library(doSNOW)

fit <- lm(Employed ~ ., data = longley)
scale <- 100
longley2 <- (longley[rep(seq(nrow(longley)), scale), ])

num_splits <-4
cl <- makeCluster(num_splits)
registerDoSNOW(cl)  

split_testing<-sort(rank(1:nrow(longley))%%num_splits)

predictions<-foreach(i= unique(split_testing),
                     .combine = c, .packages=c("stats")) %dopar% {
                       predict(fit, newdata=longley2[split_testing == i, ])
                     }
stopCluster(cl)

I am using simple data replication to test it. With scale 10 or 1000 it is working, but I would like to make it run with scale <- 1000000 - data frame with 16M rows (1.86GB data frame as indicated by object_size() from pryr. Note that when necessary I can also use Linux machine, if this is the only option.

like image 385
Tomas Greif Avatar asked Feb 24 '15 11:02

Tomas Greif


1 Answers

You can use the isplitRows function from the itertools package to send only the fraction of longley2 that is needed for the task:

library(itertools)

predictions <-
  foreach(d=isplitRows(longley2, chunks=num_splits),
          .combine=c, .packages=c("stats")) %dopar% {
    predict(fit, newdata=d)
  }

This prevents the entire longley2 data frame from being automatically exported to each of the workers and simplifies the code a bit.

like image 116
Steve Weston Avatar answered Nov 15 '22 13:11

Steve Weston