Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting how many times a condition is true within each group

I'm using a simulated dataset with many groups (+2mil) where I want to count the total number of observations and the number of observations above a threshold (here it is 2) for each one of my groups.

It seems much faster when I create a flag variable, especially for dplyr and a little bit faster for data.table.

Why does this happen? How does it work in the background in each case?

Check my examples below.

Simulated dataset

# create an example dataset
set.seed(318)

N = 3000000 # number of rows

dt = data.frame(id = sample(1:5000000, N, replace = T),
                value = runif(N, 0, 10))

Using dplyr

library(dplyr)

# calculate summary variables for each group
t = proc.time()
dt2 = dt %>% group_by(id) %>% summarise(N = n(),
                                        N2 = sum(value > 2))
proc.time() - t

# user  system elapsed
# 51.70    0.06   52.11


# calculate summary variables for each group after creating a flag variable
t = proc.time()
dt2 = dt %>% mutate(flag = ifelse(value > 2, 1, 0)) %>%
  group_by(id) %>% summarise(N = n(),
                             N2 = sum(flag))
proc.time() - t

# user  system elapsed
# 3.40    0.16    3.55

Using data.table

library(data.table)

# set as data table
dt2 = setDT(dt, key = "id")


# calculate summary variables for each group
t = proc.time()
dt3 = dt2[, .(N = .N,
              N2 = sum(value > 2)), by = id]
proc.time() - t

# user  system elapsed 
# 1.93    0.00    1.94 


# calculate summary variables for each group after creating a flag variable
t = proc.time()
dt3 = dt2[, flag := ifelse(value > 2, 1, 0)][, .(N = .N,
                                                 N2 = sum(flag)), by = id]
proc.time() - t

# user  system elapsed 
# 0.33    0.04    0.39 
like image 474
AntoniosK Avatar asked Aug 31 '16 11:08

AntoniosK


1 Answers

The issue with dplyr is that the sum function is used with an expression and a high number of IDs/groups. From what Arun is saying in the comments, I guess the issue with data.table is similar.

Consider the code below: I reduced it to the bare minimum that is necessary to illustrate the problem. dplyr is slow when summing an expression, even if the expression involves only the identity function, so the performance issues have nothing to do with the greater than comparison operator. In contrast, dplyr is fast when summing a vector. A even greater performance gain is achieved by reducing the number of IDs/groups from one million to ten.

The reason is that hybrid evaluation, i.e., evaluation in C++, works only if sum is used with a vector. With an expression as argument, the evaluation is done in R, which adds overhead for each group. The details are in the linked vignette. From the profile of the code, it seems that the overhead mainly comes from the tryCatch error handling function.

##########################
### many different IDs ###
##########################

df <- data.frame(id = 1:1e6, value = runif(1e6))

# sum with expression as argument
system.time(df %>% group_by(id) %>% summarise(sum(identity(value))))
#    user  system elapsed
#  80.492   0.368  83.251

# sum with vector as argument
system.time(df %>% group_by(id) %>% summarise(sum(value)))
#    user  system elapsed
#   1.264   0.004   1.279


#########################
### few different IDs ###
#########################

df$id <- rep(1:10, each = 1e5)

# sum with expression as argument
system.time(df %>% group_by(id) %>% summarise(sum(identity(value))))
#    user  system elapsed
#   0.088   0.000   0.093

# sum with vector as argument
system.time(df %>% group_by(id) %>% summarise(sum(value)))
#    user  system elapsed
#   0.072   0.004   0.077


#################
### profiling ###
#################

df <- data.frame(id = 1:1e6, value = runif(1e6))

profvis::profvis({ df %>% group_by(id) %>% summarise(sum(identity(value))) })

Code profile:

Code profile

like image 62
bbrot Avatar answered Nov 08 '22 12:11

bbrot