Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group_by and mutate slow on large dataframe

I am working with large (min 8 mil rows) dataframes and want to do some basic calculations based on a couple grouping variables and rmultinom. As my code stands it takes at least ~1 sec to complete the calculation, which wouldn't be a problem but I need to do it thousands of times so I would really like to speed this up.

I am currently using dataframes and tidyverse but I am not tied to either of these. I have tried to implement using data.table but couldn't figure it out. Any suggestions of how I could speed things up would be much appreciated.

An example (the real data can be an order of magnitude larger or more):

library(tidyverse)
library(microbenchmark)

# create dummy data
df <- data.frame(fact = rep(letters, each = 312000), 
                 month = rep(month.name, 26), 
                 num = rep(sample(10000:100000, 12), 26), 
                 prob = runif(312))

# Order by month     
df <- df[order(df$month), ]

# group by two factor variables and calculate new variable 
microbenchmark({
  df2 <- df %>%
    group_by(fact, month) %>%
    mutate(res = ifelse(prob > 0, c(rmultinom(1, num[1], prob = prob)), 0))}, times = 10)


 > Unit: milliseconds
 > min      lq       mean     median   uq        max         neval
 > 816.3126 822.4083 840.7966 834.6163 855.5139  879.9345    10
like image 361
flee Avatar asked Oct 08 '18 20:10

flee


3 Answers

A bit too long for a comment, so I post it here.

Running

library(profr)
plot(profr(
df %>% group_by(fact, month) %>% 
   mutate(res = ifelse(prob > 0, c(rmultinom(1, num[1], prob = prob)), 0))
))

I get the following:

enter image description here

So, it looks like you really want to find a faster implementation for multinom, which seems to be the bottleneck. This bottleneck is the same for both dplyr and data.table, which means only speeding up rmultinorm will give you substantial speed improvements.

like image 55
coffeinjunky Avatar answered Nov 15 '22 20:11

coffeinjunky


You might as well reduce the overhead caused by the pipe operator, in both dplyr and DT syntax.

To illustrate the overhead caused by pipes:

microbenchmark(pipe = iris %>%
                 group_by(Species) %>% 
                 mutate(mean = mean(Sepal.Length)),
               no_pipe = mutate(group_by(iris, Species), mean = mean(Sepal.Length)),
               times = 100) %>% autoplot()

enter image description here

like image 38
Rich Pauloo Avatar answered Nov 15 '22 19:11

Rich Pauloo


Using data.table, you could do:

dt <- copy(df)
setDT(dt)

dt[, res := 0L][prob > 0, res := c(rmultinom(1, num[1], prob = prob)), by = .(fact, month)]

Which gives you a minor speed improvement:

microbenchmark(dp = df %>%
                 group_by(fact, month) %>%
                 mutate(res = ifelse(prob > 0, c(rmultinom(1, num[1], prob = prob)), 0)),
               dt = dt[, res := 0L][prob > 0, res := c(rmultinom(1, num[1], prob = prob)), by = .(fact, month)],
               times = 1)
Unit: seconds
 expr      min       lq     mean   median       uq      max neval
   dp 1.356745 1.356745 1.356745 1.356745 1.356745 1.356745     1
   dt 1.063363 1.063363 1.063363 1.063363 1.063363 1.063363     1
like image 21
Jaap Avatar answered Nov 15 '22 19:11

Jaap