I am trying to run predict()
in parallel on my Windows machine. This works on smaller dataset, but does not scale well as for each process new copy of data frame is created. Is there a way how to run in parallel without making temporary copies?
My code (only few modifications of this original code):
library(foreach)
library(doSNOW)
fit <- lm(Employed ~ ., data = longley)
scale <- 100
longley2 <- (longley[rep(seq(nrow(longley)), scale), ])
num_splits <-4
cl <- makeCluster(num_splits)
registerDoSNOW(cl)
split_testing<-sort(rank(1:nrow(longley))%%num_splits)
predictions<-foreach(i= unique(split_testing),
.combine = c, .packages=c("stats")) %dopar% {
predict(fit, newdata=longley2[split_testing == i, ])
}
stopCluster(cl)
I am using simple data replication to test it. With scale
10 or 1000 it is working, but I would like to make it run with scale <- 1000000
- data frame with 16M rows (1.86GB data frame as indicated by object_size()
from pryr
. Note that when necessary I can also use Linux machine, if this is the only option.
You can use the isplitRows
function from the itertools
package to send only the fraction of longley2
that is needed for the task:
library(itertools)
predictions <-
foreach(d=isplitRows(longley2, chunks=num_splits),
.combine=c, .packages=c("stats")) %dopar% {
predict(fit, newdata=d)
}
This prevents the entire longley2
data frame from being automatically exported to each of the workers and simplifies the code a bit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With