I'm using a simulated dataset with many groups (+2mil) where I want to count the total number of observations and the number of observations above a threshold (here it is 2) for each one of my groups.
It seems much faster when I create a flag variable, especially for dplyr
and a little bit faster for data.table
.
Why does this happen? How does it work in the background in each case?
Check my examples below.
Simulated dataset
# create an example dataset
set.seed(318)
N = 3000000 # number of rows
dt = data.frame(id = sample(1:5000000, N, replace = T),
value = runif(N, 0, 10))
Using dplyr
library(dplyr)
# calculate summary variables for each group
t = proc.time()
dt2 = dt %>% group_by(id) %>% summarise(N = n(),
N2 = sum(value > 2))
proc.time() - t
# user system elapsed
# 51.70 0.06 52.11
# calculate summary variables for each group after creating a flag variable
t = proc.time()
dt2 = dt %>% mutate(flag = ifelse(value > 2, 1, 0)) %>%
group_by(id) %>% summarise(N = n(),
N2 = sum(flag))
proc.time() - t
# user system elapsed
# 3.40 0.16 3.55
Using data.table
library(data.table)
# set as data table
dt2 = setDT(dt, key = "id")
# calculate summary variables for each group
t = proc.time()
dt3 = dt2[, .(N = .N,
N2 = sum(value > 2)), by = id]
proc.time() - t
# user system elapsed
# 1.93 0.00 1.94
# calculate summary variables for each group after creating a flag variable
t = proc.time()
dt3 = dt2[, flag := ifelse(value > 2, 1, 0)][, .(N = .N,
N2 = sum(flag)), by = id]
proc.time() - t
# user system elapsed
# 0.33 0.04 0.39
The issue with dplyr is that the sum function is used with an expression and a high number of IDs/groups. From what Arun is saying in the comments, I guess the issue with data.table is similar.
Consider the code below: I reduced it to the bare minimum that is necessary to illustrate the problem. dplyr is slow when summing an expression, even if the expression involves only the identity function, so the performance issues have nothing to do with the greater than comparison operator. In contrast, dplyr is fast when summing a vector. A even greater performance gain is achieved by reducing the number of IDs/groups from one million to ten.
The reason is that hybrid evaluation, i.e., evaluation in C++, works only if sum is used with a vector. With an expression as argument, the evaluation is done in R, which adds overhead for each group. The details are in the linked vignette. From the profile of the code, it seems that the overhead mainly comes from the tryCatch error handling function.
##########################
### many different IDs ###
##########################
df <- data.frame(id = 1:1e6, value = runif(1e6))
# sum with expression as argument
system.time(df %>% group_by(id) %>% summarise(sum(identity(value))))
# user system elapsed
# 80.492 0.368 83.251
# sum with vector as argument
system.time(df %>% group_by(id) %>% summarise(sum(value)))
# user system elapsed
# 1.264 0.004 1.279
#########################
### few different IDs ###
#########################
df$id <- rep(1:10, each = 1e5)
# sum with expression as argument
system.time(df %>% group_by(id) %>% summarise(sum(identity(value))))
# user system elapsed
# 0.088 0.000 0.093
# sum with vector as argument
system.time(df %>% group_by(id) %>% summarise(sum(value)))
# user system elapsed
# 0.072 0.004 0.077
#################
### profiling ###
#################
df <- data.frame(id = 1:1e6, value = runif(1e6))
profvis::profvis({ df %>% group_by(id) %>% summarise(sum(identity(value))) })
Code profile:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With