Group_by and mutate slow on large dataframe

Question

I am working with large (min 8 mil rows) dataframes and want to do some basic calculations based on a couple grouping variables and rmultinom. As my code stands it takes at least ~1 sec to complete the calculation, which wouldn't be a problem but I need to do it thousands of times so I would really like to speed this up.

I am currently using dataframes and tidyverse but I am not tied to either of these. I have tried to implement using data.table but couldn't figure it out. Any suggestions of how I could speed things up would be much appreciated.

An example (the real data can be an order of magnitude larger or more):

library(tidyverse)
library(microbenchmark)

# create dummy data
df <- data.frame(fact = rep(letters, each = 312000), 
                 month = rep(month.name, 26), 
                 num = rep(sample(10000:100000, 12), 26), 
                 prob = runif(312))

# Order by month     
df <- df[order(df$month), ]

# group by two factor variables and calculate new variable 
microbenchmark({
  df2 <- df %>%
    group_by(fact, month) %>%
    mutate(res = ifelse(prob > 0, c(rmultinom(1, num[1], prob = prob)), 0))}, times = 10)


 > Unit: milliseconds
 > min      lq       mean     median   uq        max         neval
 > 816.3126 822.4083 840.7966 834.6163 855.5139  879.9345    10

coffeinjunky · Accepted Answer

A bit too long for a comment, so I post it here.

Running

library(profr)
plot(profr(
df %>% group_by(fact, month) %>% 
   mutate(res = ifelse(prob > 0, c(rmultinom(1, num[1], prob = prob)), 0))
))

I get the following:

enter image description here

So, it looks like you really want to find a faster implementation for multinom, which seems to be the bottleneck. This bottleneck is the same for both dplyr and data.table, which means only speeding up rmultinorm will give you substantial speed improvements.

Rich Pauloo · Answer

You might as well reduce the overhead caused by the pipe operator, in both dplyr and DT syntax.

To illustrate the overhead caused by pipes:

microbenchmark(pipe = iris %>%
                 group_by(Species) %>% 
                 mutate(mean = mean(Sepal.Length)),
               no_pipe = mutate(group_by(iris, Species), mean = mean(Sepal.Length)),
               times = 100) %>% autoplot()

enter image description here

Jaap · Answer

Using data.table, you could do:

dt <- copy(df)
setDT(dt)

dt[, res := 0L][prob > 0, res := c(rmultinom(1, num[1], prob = prob)), by = .(fact, month)]

Which gives you a minor speed improvement:

microbenchmark(dp = df %>%
                 group_by(fact, month) %>%
                 mutate(res = ifelse(prob > 0, c(rmultinom(1, num[1], prob = prob)), 0)),
               dt = dt[, res := 0L][prob > 0, res := c(rmultinom(1, num[1], prob = prob)), by = .(fact, month)],
               times = 1)

Unit: seconds
 expr      min       lq     mean   median       uq      max neval
   dp 1.356745 1.356745 1.356745 1.356745 1.356745 1.356745     1
   dt 1.063363 1.063363 1.063363 1.063363 1.063363 1.063363     1

Group_by and mutate slow on large dataframe

Tags:

dataframe

r

data.table

tidyverse

flee

3 Answers

coffeinjunky

Rich Pauloo

Jaap

Recent Activity

Donate For Us

Group_by and mutate slow on large dataframe

Tags:

dataframe

r

data.table

tidyverse

flee

3 Answers

coffeinjunky

Rich Pauloo

Jaap

Related questions

Recent Activity

Donate For Us