Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Expanding window (cumulative calculation) in data.table: how to improve performance

I have grouped data collected at different time steps. Within each time step, there are several registrations of values. Each value may occur one or more times within and among time steps.

Some toy data:

df <- data.frame(grp = rep(1:2, each = 8),
                 time = c(rep(1, 3), rep(2, 2), rep(3, 3)),
                 val = c(1, 2, 1,  2, 3,  2, 3, 4,  1, 2, 3,  1, 1,  1, 2, 3))

df
#    grp time val
# 1    1    1   1
# 2    1    1   2
# 3    1    1   1
# 4    1    2   2
# 5    1    2   3
# 6    1    3   2
# 7    1    3   3
# 8    1    3   4
# 9    2    1   1
# 10   2    1   2
# 11   2    1   3
# 12   2    2   1
# 13   2    2   1
# 14   2    3   1
# 15   2    3   2
# 16   2    3   3

Objectives

I wish to do some calculations within an expanding time window, i.e. within time step 1, within time 1 and 2 together, within 1, 2, and 3 together, and so on. Within each window, I wish to calculate the number of unique values, the number of values which have occurred more than once, and the proportion of values which have occurred more than once.

For example, in my toy data, in group (grp) 1, in the second time window (time = 1 & 2 together) three unique values (val 1, 2, 3) have been registered (n_val = 3). Two of them (1, 2) occur more than once (n_re = 2), resulting in a "re_rate" of 0.67 (see below).

My data.table code produces the desired result. On a small data set it is slower than my base attempt, which I believe is fair enough, given some possible overhead in the data.table code. With a larger data set, the data.table code catches up, but is still slower. I expected (hoped) that the benefits would show up earlier.

Thus, what made me post this question is that I believe that the relative performance of my code is a strong indicator of me abusing data.table (I am sure the reason is not data.table performance itself). Thus, the main objective of my question is to get some advice on how to code this in a more data.table-esque way. For example, is it possible to avoid the loop over time windows altogether by vectorizing the calculations, as shown e.g. in the nice answer by @Khashaa here. If not, are there ways to make the loop and assignment more efficient?


My data.table code:

library(data.table)

f_dt <- function(df){
  setDT(df, key = c("grp", "time", "val"))[ , {
  # key or not only affects speed marginally

    # unique time steps
    times <- .SD[ , unique(time)]

    # index vector to loop over
    idx <- seq_along(times)

    # pre-allocate data table
    d2 <- data.table(time = times,
                     n_val = integer(1),
                     n_re = integer(1),
                     re_rate = numeric(1))

    # loop to generate expanding window
    for(i in idx){

      # number of registrations per val
      n <- .SD[time %in% times[seq_len(i)], .(n = .N), by = val][ , n]

      # number of unique val
      set(x = d2, i = i, j = 2L, length(n))

      # number of val registered more than once
      set(x = d2, i = i, j = 3L, sum(n > 1))
    }
    # proportion values registered more than once
    d2[ , re_rate := round(n_re / n_val, 2)]
    d2
  }
  , by = grp]
}

...which gives the desired result:

f_dt(df)

#    grp time n_val n_re re_rate
# 1:   1    1     2    1    0.50
# 2:   1    2     3    2    0.67
# 3:   1    3     4    3    0.75
# 4:   2    1     3    0    0.00
# 5:   2    2     3    1    0.33
# 6:   2    3     3    3    1.00

Corresponding base code:

f_by <- function(df){
  do.call(rbind,
          by(data = df, df$grp, function(d){

            times <- unique(d$time)
            idx <- seq_along(times)
            d2 <- data.frame(grp = d$grp[1],
                             time = times,
                             n_val = integer(1),
                             n_re = integer(1),
                             re_rate = numeric(1))

            for(i in idx){

              dat <- d[d$time %in% times[seq_len(i)], ]
              tt <- table(dat$val)
              n_re <- sum(tt > 1)
              n_val <- length(tt)
              re_rate <- round(n_re / n_val, 2)

              d2[i, ] <- data.frame(d2$grp[1], time = times[i], n_val, n_re, re_rate)
            }
            d2
          })
  )
}

Timings:

Tiny toy data from above:

library(microbenchmark)
microbenchmark(f_by(df),
               f_dt(df),
               times = 10,
               unit = "relative")

# Unit: relative
#     expr      min       lq     mean   median       uq      max neval
# f_by(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10
# f_dt(df) 1.481724 1.450203 1.474037 1.452887 1.521378 1.502686    10

Some larger data:

set.seed(123)
df <- data.frame(grp = sample(1:100, 100000, replace = TRUE),
                 time = sample(1:100, 100000, replace = TRUE),
                 val = sample(1:100, 100000, replace = TRUE))

microbenchmark(f_by(df),
               f_dt(df),
               times = 10,
               unit = "relative")

# Unit: relative
#     expr      min       lq     mean   median       uq      max neval
# f_by(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10
# f_dt(df) 1.094424 1.099642 1.107821 1.096997 1.097693 1.194983    10

No, the data is still not large, but I would expect data.table to catch up by now. If coded properly... I believe this suggests that there is a large potential for improvement of my code. Any advice is highly appreciated.

like image 808
Henrik Avatar asked Feb 08 '23 08:02

Henrik


2 Answers

f <- function(df){
  setDT(df)[, n_val := cumsum(!duplicated(val)), grp
   ][, occ := 1:.N, .(grp, val)
     ][, occ1 := cumsum(occ == 1) - cumsum(occ == 2), grp
       ][, n_re := n_val - occ1,
         ][, re_rate := round(n_re/n_val, 2),
           ][, .(n_val = n_val[.N], n_re = n_re[.N], re_rate =re_rate[.N]), .(grp, time)]
}

where

  • cumsum(!duplicated(val)) counts the (cumulative) occurrences of the unique values, n_val,
  • occ counts the cumulative occurrences each value (note that it is grouped by val).
  • occ1 then counts the number of elements in val occurred only once so far. The number of values occurred only once increases by 1 when occ==1, decreases by 1 when occ==2; hence cumsum(occ == 1) - cumsum(occ == 2).
  • The number of values which have occurred more than once is n_val-occ1

Speed Comparison

set.seed(123)
df <- data.frame(grp = sample(1:100, 100000, replace = TRUE),
                 time = sample(1:100, 100000, replace = TRUE),
                 val = sample(1:100, 100000, replace = TRUE))


system.time(f(df))
# user  system elapsed 
# 0.038   0.000   0.038 

system.time(f_dt(df))
# user  system elapsed 
# 16.617   0.013  16.727

system.time(f_by(df))
# user  system elapsed 
# 16.077   0.040  16.122 

Hope this helps.

like image 65
Khashaa Avatar answered Feb 11 '23 07:02

Khashaa


Was looking for a better way to code expanding window of non-duplicated groups and came across this question.

This question seems to be more about expanding window where the group (i.e. time in the question) is duplicated. Below is a solution making use of between.

#expanding group by where groups are duplicated
library(data.table)
setDT(df)
df[ , {
        #get list of unique time groups to be used in the expanding group
        uniqt <- unique(time)

        c(list(time=uniqt), #output time as well

            #expanding window of each unique time group
            do.call(rbind, lapply(uniqt, function(n) {

                #tabulate the occurrences
                x <- table(val[between(time, uniqt[1L], n)])

                #calculate desired values
                n_val <- length(x)
                n_re <- sum(x > 1)
                data.frame(n_val=n_val, n_re=n_re, re_rate=n_re/n_val)
        })))

    }, by=grp]

result:

#    grp time n_val n_re   re_rate
# 1:   1    1     2    1 0.5000000
# 2:   1    2     3    2 0.6666667
# 3:   1    3     4    3 0.7500000
# 4:   2    1     3    0 0.0000000
# 5:   2    2     3    1 0.3333333
# 6:   2    3     3    3 1.0000000

I was unable to find in which version of data.table was between first released and hence, between might be released after this question was posted.

like image 39
chinsoon12 Avatar answered Feb 11 '23 08:02

chinsoon12