I am working with large (min 8 mil rows) dataframes
and want to do some basic calculations based on a couple grouping variables and rmultinom
. As my code stands it takes at least ~1 sec to complete the calculation, which wouldn't be a problem but I need to do it thousands of times so I would really like to speed this up.
I am currently using dataframes
and tidyverse
but I am not tied to either of these. I have tried to implement using data.table
but couldn't figure it out. Any suggestions of how I could speed things up would be much appreciated.
An example (the real data can be an order of magnitude larger or more):
library(tidyverse)
library(microbenchmark)
# create dummy data
df <- data.frame(fact = rep(letters, each = 312000),
month = rep(month.name, 26),
num = rep(sample(10000:100000, 12), 26),
prob = runif(312))
# Order by month
df <- df[order(df$month), ]
# group by two factor variables and calculate new variable
microbenchmark({
df2 <- df %>%
group_by(fact, month) %>%
mutate(res = ifelse(prob > 0, c(rmultinom(1, num[1], prob = prob)), 0))}, times = 10)
> Unit: milliseconds
> min lq mean median uq max neval
> 816.3126 822.4083 840.7966 834.6163 855.5139 879.9345 10
A bit too long for a comment, so I post it here.
Running
library(profr)
plot(profr(
df %>% group_by(fact, month) %>%
mutate(res = ifelse(prob > 0, c(rmultinom(1, num[1], prob = prob)), 0))
))
I get the following:
So, it looks like you really want to find a faster implementation for multinom
, which seems to be the bottleneck. This bottleneck is the same for both dplyr
and data.table
, which means only speeding up rmultinorm
will give you substantial speed improvements.
You might as well reduce the overhead caused by the pipe operator, in both dplyr
and DT
syntax.
To illustrate the overhead caused by pipes:
microbenchmark(pipe = iris %>%
group_by(Species) %>%
mutate(mean = mean(Sepal.Length)),
no_pipe = mutate(group_by(iris, Species), mean = mean(Sepal.Length)),
times = 100) %>% autoplot()
Using data.table, you could do:
dt <- copy(df)
setDT(dt)
dt[, res := 0L][prob > 0, res := c(rmultinom(1, num[1], prob = prob)), by = .(fact, month)]
Which gives you a minor speed improvement:
microbenchmark(dp = df %>%
group_by(fact, month) %>%
mutate(res = ifelse(prob > 0, c(rmultinom(1, num[1], prob = prob)), 0)),
dt = dt[, res := 0L][prob > 0, res := c(rmultinom(1, num[1], prob = prob)), by = .(fact, month)],
times = 1)
Unit: seconds expr min lq mean median uq max neval dp 1.356745 1.356745 1.356745 1.356745 1.356745 1.356745 1 dt 1.063363 1.063363 1.063363 1.063363 1.063363 1.063363 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With