Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Custom rcpp last function slow with dplyr group_by and summarise compared to tapply

I'm trying to get a sense of how to write Rcpp summarise functions that will be fast with dplyr. The motivation for this is a function that dplyr does not seem to have an equivalent for, however, for the sake of simplicity, I'm going to use the example of just taking the last element of a vector.

In the code below, I consider three different functions to get the last element of a vector and apply them using both tapply and dplyr group_by/summarise.

library(dplyr)
library(microbenchmark)
library(Rcpp)
n <- 5000
df <- data.frame(grp = factor(rep(1:n, 2)), valn = rnorm(2L*n), stringsAsFactors = F)

dplyr_num_last_element <- function() df %>% group_by(grp) %>% summarise(valn = last(valn))
dplyr_num_last_element_r <- function() df %>% group_by(grp) %>% summarise(valn = last_r(valn))
dplyr_num_last_element_rcpp <- function() df %>% group_by(grp) %>% summarise(val = last_rcpp(valn))
tapply_num_last_element <- function() tapply(df$valn, df$grp, FUN = last)
tapply_num_last_element_r <- function() tapply(df$valn, df$grp, FUN = last_r)
tapply_num_last_element_rcpp <- function() tapply(df$valn, df$grp, FUN = last_rcpp)

last_r <- function(x) {
  x[1]
}

cppFunction('double last_rcpp(NumericVector x) {
             int n = x.size();
             return x[n-1];
           }')

microbenchmark(dplyr_num_last_element(), dplyr_num_last_element_r(), dplyr_num_last_element_rcpp(), tapply_num_last_element(), tapply_num_last_element_r(), tapply_num_last_element_rcpp(), times = 10) 

Unit: milliseconds
                           expr        min         lq       mean     median         uq       max neval
       dplyr_num_last_element()   6.895850   7.088472   8.264270   7.766421   9.089424  11.00775    10
     dplyr_num_last_element_r() 205.375404 214.481520 220.995218 220.107130 225.971179 238.62544    10
  dplyr_num_last_element_rcpp() 211.593443 216.000009 222.247786 221.984289 228.801007 230.50220    10
      tapply_num_last_element()  97.082102  99.528712 101.955668 101.717887 104.370319 109.26982    10
    tapply_num_last_element_r()   6.101055   6.550065   7.386442   7.069754   7.589164   9.98025    10
 tapply_num_last_element_rcpp()  14.173171  15.145711  16.102816  15.400562  16.053229  22.00147    10

My general questions are:

1) Why does the dplyr_num_last_element_r take on avg 220 ms, while tapply_num_last_element_r takes 7 ms.

2) Is there any way to write my own last function to use with dplyr, but have it take more on the order of 7ms?

Thanks!

like image 746
user2506086 Avatar asked Oct 17 '22 06:10

user2506086


1 Answers

I have some results which are different from you. Note that I changed last_r to return the last element and used dplyr::last (because there is also data.table::last).

library(dplyr)
library(microbenchmark)
library(Rcpp)
n <- 5000
df <- data.frame(
  grp = factor(rep(1:n, 2)), 
  valn = rnorm(2L*n), 
  stringsAsFactors = FALSE
)

last_r <- function(x) {
  tail(x, 1)
}

cppFunction('double last_rcpp(NumericVector x) {
            int n = x.size();
            return x[n-1];
            }')

dplyr_num_last_element <- function() df %>% group_by(grp) %>% summarise(valn = dplyr::last(valn))
dplyr_num_last_element_r <- function() df %>% group_by(grp) %>% summarise(valn = last_r(valn))
dplyr_num_last_element_rcpp <- function() df %>% group_by(grp) %>% summarise(val = last_rcpp(valn))
tapply_num_last_element <- function() tapply(df$valn, df$grp, FUN = dplyr::last)
tapply_num_last_element_r <- function() tapply(df$valn, df$grp, FUN = last_r)
tapply_num_last_element_rcpp <- function() tapply(df$valn, df$grp, FUN = last_rcpp)


library(data.table) 
dt <- data.table(df)
DT_num_last_element_r <- function() {
  setkey(dt, grp)
  dt[, last_r(valn), grp]
}
microbenchmark(
  DT_num_last_element_r(), 
  dplyr_num_last_element(), 
  dplyr_num_last_element_r(), 
  dplyr_num_last_element_rcpp(), 
  tapply_num_last_element(), 
  tapply_num_last_element_r(), 
  tapply_num_last_element_rcpp(), 
  times = 20
) 

Benchmark:

Unit: milliseconds
                           expr        min        lq      mean    median        uq       max neval
        DT_num_last_element_r()  53.956258  55.76482  57.08700  57.33898  58.50556  59.03580    20
       dplyr_num_last_element() 224.289272 228.97531 235.87757 233.73353 237.56040 293.77219    20
     dplyr_num_last_element_r() 178.778382 182.11143 187.40303 184.34760 187.00788 246.64526    20
  dplyr_num_last_element_rcpp() 107.510245 109.64476 111.56974 112.50635 113.63999 114.92428    20
      tapply_num_last_element()  55.999728  58.68948  60.68782  59.78769  63.78408  66.06941    20
    tapply_num_last_element_r()  54.591615  57.31017  58.29962  58.16951  59.98568  63.08996    20
 tapply_num_last_element_rcpp()   9.558151  10.66994  14.76226  11.54004  12.64156  73.87743    20

My results are more coherent. Can you test with these small changes?

This on Windows 10, R 3.4.0 (JIT compiler enabled).

like image 158
F. Privé Avatar answered Oct 30 '22 16:10

F. Privé