Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parallelelize do() calls with dplyr

Tags:

r

dplyr

I'm trying to figure out how to deploy the dplyr::do function in parallel. After reading some the docs it seems that the dplyr::init_cluster() should be sufficient for telling the do() to run in parallel. Unfortunately this doesn't seem to be the case when I test this:

library(dplyr)
test <- data_frame(a=1:3, b=letters[c(1:2, 1)])

init_cluster()
system.time({
  test %>%
    group_by(b) %>%
    do({
      Sys.sleep(3)
      data_frame(c = rep(max(.$a), times = max(.$a)))
    })
})
stop_cluster()

Gives this output:

Initialising 2 core cluster.
|==========================================================================|100% ~0 s remaining
   user  system elapsed 
   0.03    0.00    6.03 

I would expect it to be 3 if the do call was split between the two cores. I can also confirm this by adding a print to the do() that prints in the main R-terminal. What am I missing here?

I'm using dplyr 0.4.2 with R 3.2.1

like image 757
Max Gordon Avatar asked Jul 26 '15 12:07

Max Gordon


2 Answers

As per mentionned by @Maciej, you could try multidplyr:

## Install from github
devtools::install_github("hadley/multidplyr")

Use partition() to split your dataset across multiples cores:

library(dplyr)
library(multidplyr)
test <- data_frame(a=1:3, b=letters[c(1:2, 1)])
test1 <- partition(test, a)

You'll initialize a 3 cores cluster (one for each a)

# Initialising 3 core cluster.

Then simply perform your do() call:

test1 %>%
  do({
    dplyr::data_frame(c = rep(max(.$a)), times = max(.$a))
  })

Which gives:

#Source: party_df [3 x 3]
#Groups: a
#Shards: 3 [1--1 rows]
#
#      a     c times
#  (int) (int) (int)
#1     1     1     1
#2     2     2     2
#3     3     3     3
like image 198
Steven Beaupré Avatar answered Nov 19 '22 19:11

Steven Beaupré


You could check Hadley's new package multidplyr.

like image 8
Maciej Avatar answered Nov 19 '22 20:11

Maciej