I have grouped data collected at different time steps. Within each time step, there are several registrations of values. Each value may occur one or more times within and among time steps.
df <- data.frame(grp = rep(1:2, each = 8),
time = c(rep(1, 3), rep(2, 2), rep(3, 3)),
val = c(1, 2, 1, 2, 3, 2, 3, 4, 1, 2, 3, 1, 1, 1, 2, 3))
df
# grp time val
# 1 1 1 1
# 2 1 1 2
# 3 1 1 1
# 4 1 2 2
# 5 1 2 3
# 6 1 3 2
# 7 1 3 3
# 8 1 3 4
# 9 2 1 1
# 10 2 1 2
# 11 2 1 3
# 12 2 2 1
# 13 2 2 1
# 14 2 3 1
# 15 2 3 2
# 16 2 3 3
I wish to do some calculations within an expanding time window, i.e. within time step 1, within time 1 and 2 together, within 1, 2, and 3 together, and so on. Within each window, I wish to calculate the number of unique values, the number of values which have occurred more than once, and the proportion of values which have occurred more than once.
For example, in my toy data, in group (grp) 1, in the second time window (time = 1 & 2 together) three unique values (val 1, 2, 3) have been registered (n_val = 3). Two of them (1, 2) occur more than once (n_re = 2), resulting in a "re_rate" of 0.67 (see below).
My data.table code produces the desired result. On a small data set it is slower than my base
attempt, which I believe is fair enough, given some possible overhead in the data.table code. With a larger data set, the data.table
code catches up, but is still slower. I expected (hoped) that the benefits would show up earlier.
Thus, what made me post this question is that I believe that the relative performance of my code is a strong indicator of me abusing data.table (I am sure the reason is not data.table performance itself). Thus, the main objective of my question is to get some advice on how to code this in a more data.table-esque way. For example, is it possible to avoid the loop over time windows altogether by vectorizing the calculations, as shown e.g. in the nice answer by @Khashaa here. If not, are there ways to make the loop and assignment more efficient?
data.table
code:library(data.table)
f_dt <- function(df){
setDT(df, key = c("grp", "time", "val"))[ , {
# key or not only affects speed marginally
# unique time steps
times <- .SD[ , unique(time)]
# index vector to loop over
idx <- seq_along(times)
# pre-allocate data table
d2 <- data.table(time = times,
n_val = integer(1),
n_re = integer(1),
re_rate = numeric(1))
# loop to generate expanding window
for(i in idx){
# number of registrations per val
n <- .SD[time %in% times[seq_len(i)], .(n = .N), by = val][ , n]
# number of unique val
set(x = d2, i = i, j = 2L, length(n))
# number of val registered more than once
set(x = d2, i = i, j = 3L, sum(n > 1))
}
# proportion values registered more than once
d2[ , re_rate := round(n_re / n_val, 2)]
d2
}
, by = grp]
}
...which gives the desired result:
f_dt(df)
# grp time n_val n_re re_rate
# 1: 1 1 2 1 0.50
# 2: 1 2 3 2 0.67
# 3: 1 3 4 3 0.75
# 4: 2 1 3 0 0.00
# 5: 2 2 3 1 0.33
# 6: 2 3 3 3 1.00
base
code:f_by <- function(df){
do.call(rbind,
by(data = df, df$grp, function(d){
times <- unique(d$time)
idx <- seq_along(times)
d2 <- data.frame(grp = d$grp[1],
time = times,
n_val = integer(1),
n_re = integer(1),
re_rate = numeric(1))
for(i in idx){
dat <- d[d$time %in% times[seq_len(i)], ]
tt <- table(dat$val)
n_re <- sum(tt > 1)
n_val <- length(tt)
re_rate <- round(n_re / n_val, 2)
d2[i, ] <- data.frame(d2$grp[1], time = times[i], n_val, n_re, re_rate)
}
d2
})
)
}
Tiny toy data from above:
library(microbenchmark)
microbenchmark(f_by(df),
f_dt(df),
times = 10,
unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval
# f_by(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f_dt(df) 1.481724 1.450203 1.474037 1.452887 1.521378 1.502686 10
Some larger data:
set.seed(123)
df <- data.frame(grp = sample(1:100, 100000, replace = TRUE),
time = sample(1:100, 100000, replace = TRUE),
val = sample(1:100, 100000, replace = TRUE))
microbenchmark(f_by(df),
f_dt(df),
times = 10,
unit = "relative")
# Unit: relative
# expr min lq mean median uq max neval
# f_by(df) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10
# f_dt(df) 1.094424 1.099642 1.107821 1.096997 1.097693 1.194983 10
No, the data is still not large, but I would expect data.table to catch up by now. If coded properly... I believe this suggests that there is a large potential for improvement of my code. Any advice is highly appreciated.
f <- function(df){
setDT(df)[, n_val := cumsum(!duplicated(val)), grp
][, occ := 1:.N, .(grp, val)
][, occ1 := cumsum(occ == 1) - cumsum(occ == 2), grp
][, n_re := n_val - occ1,
][, re_rate := round(n_re/n_val, 2),
][, .(n_val = n_val[.N], n_re = n_re[.N], re_rate =re_rate[.N]), .(grp, time)]
}
where
cumsum(!duplicated(val))
counts the (cumulative) occurrences of the unique values, n_val
,occ
counts the cumulative occurrences each value (note that it is grouped by val
). occ1
then counts the number of elements in val
occurred only once so far.
The number of values occurred only once increases by 1 when occ==1
, decreases by 1 when occ==2
; hence cumsum(occ == 1) - cumsum(occ == 2)
.n_val-occ1
Speed Comparison
set.seed(123)
df <- data.frame(grp = sample(1:100, 100000, replace = TRUE),
time = sample(1:100, 100000, replace = TRUE),
val = sample(1:100, 100000, replace = TRUE))
system.time(f(df))
# user system elapsed
# 0.038 0.000 0.038
system.time(f_dt(df))
# user system elapsed
# 16.617 0.013 16.727
system.time(f_by(df))
# user system elapsed
# 16.077 0.040 16.122
Hope this helps.
Was looking for a better way to code expanding window of non-duplicated groups and came across this question.
This question seems to be more about expanding window where the group (i.e. time in the question) is duplicated. Below is a solution making use of between
.
#expanding group by where groups are duplicated
library(data.table)
setDT(df)
df[ , {
#get list of unique time groups to be used in the expanding group
uniqt <- unique(time)
c(list(time=uniqt), #output time as well
#expanding window of each unique time group
do.call(rbind, lapply(uniqt, function(n) {
#tabulate the occurrences
x <- table(val[between(time, uniqt[1L], n)])
#calculate desired values
n_val <- length(x)
n_re <- sum(x > 1)
data.frame(n_val=n_val, n_re=n_re, re_rate=n_re/n_val)
})))
}, by=grp]
result:
# grp time n_val n_re re_rate
# 1: 1 1 2 1 0.5000000
# 2: 1 2 3 2 0.6666667
# 3: 1 3 4 3 0.7500000
# 4: 2 1 3 0 0.0000000
# 5: 2 2 3 1 0.3333333
# 6: 2 3 3 3 1.0000000
I was unable to find in which version of data.table
was between
first released and hence, between
might be released after this question was posted.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With