I am new to multidplyr. I have a dataset similar to what this creates:
library(multidplyr)
library(tidyverse)
library(nycflights13)
f<-flights %>% group_by(month) %>% nest()
Now I´d like to do operations on each of these tibbles on different nodes.
cluster <- create_cluster(12)
f2<-partition(f,month,cluster=cluster)
everything seems ok until here, but when I do:
models<-f2 %>%
do(mod=lm(arr_delay~dey_delay,data=.))
I get the following error msg:
Error in checkForRemoteErrors(lapply(cl, recvResult)) :
12 nodes produced errors; first error: object 'arr_delay' not found
Now if I try
f2 %>% browser(.)
and then try .$ I do not have access to any of the columns-
Any ideas how these columns can be accessed?
This question has two parts:
do?The "proper" way to apply functions to a nested column (or "list column") is not to use do, but to use map instead. In this case, multidplyr isn't really important, since the normal dplyr code gives the same error.
f <- flights %>% group_by(month) %>% nest()
models <- f %>%
do(mod = lm(arr_delay ~ dey_delay, data = .))
Error in eval(expr, envir, enclos) : object 'arr_delay' not found
Using map from purrr on the other hand works fine.
models <- f %>%
mutate(model = purrr::map(data, ~ lm(arr_delay ~ dep_delay, data = .)))
Using your multidplyr code with mutate and map also works just fine.
party_df?You can't easily do that. Remember they are not available in your current R session, but on the nodes. You can access the names using this little utility function:
names.party_df <- function(x) {
fun <- function(x) names(eval(x))
multidplyr::cluster_call(x$cluster, fun, as.name(x$name))[[1]]
}
But to access the full data, you'll most likely need to collect your data again. Alternatively, in RStudio one can use View, but note that this doesn't work great on large data sets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With