I'm trying to use multidplyr
to speed up getting residuals
from a regression
fit. I've created a function
that fits the regression
model to get the residuals
, which in addition to the data, gets two more arguments.
Here's the function
:
func <- function(df,reg.mdl,mdl.fmla)
{
if(reg.mdl == "linear"){
df$resid <- lm(formula = mdl.fmla, data = df)$residuals
} else if(reg.mdl == "poisson"){
df$resid <- residuals(object = glm(formula = mdl.fmla,data = df,family = "poisson"),type='pearson')
}
return(df)
}
Here's an example data on which I'll try my multidplyr
approach:
set.seed(1)
ds <- data.frame(group=c(rep("a",100), rep("b",100),rep("c",100)),sex=rep(sample(c("F","M"),100,replace=T),3),y=rpois(300,10))
model.formula <- as.formula("y ~ sex")
regression.model <- "poisson"
And here's the multidplyr
approach:
ds %>% partition(group) %>% cluster_library("tidyverse") %>%
cluster_assign_value("func", func) %>%
do(results = func(df=.,reg.mdl=regression.model,mdl.fmla=model.formula)) %>% collect() %>% .$results %>% bind_rows()
This throws this error though:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
3 nodes produced errors; first error: object 'regression.model' not found
In addition: Warning message:
group_indices_.grouped_df ignores extra arguments
So I guess the way I'm passing the arguments to func
from do
is wrong.
Any idea what's the correct way?
Error caused by the fact that clusters don't have such objects in their enviroment. As such it is required to assign variables to cluster process:
ds %>%
partition(group) %>%
cluster_library("tidyverse") %>%
cluster_assign_value("func", func) %>%
cluster_copy(regression.model) %>%
cluster_copy(model.formula) %>%
do(results = func(
df = .,
reg.mdl = regression.model,
mdl.fmla = model.formula
)) %>%
collect() %>%
.$results %>%
bind_rows()
Or another way (I prefer to set up clusters before chain):
CL <- makePSOCKcluster(3)
clusterEvalQ(cl = CL, library("tidyverse"))
clusterExport(cl = CL, list("func", "regression.model", "model.formula"))
ds %>%
partition(group, cluster = CL) %>%
do(results = func(
df = .,
reg.mdl = regression.model,
mdl.fmla = model.formula
)) %>%
collect() %>%
.$results %>%
bind_rows()
stopCluster(CL)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With