I have a discrete-time stochastic model where individuals enter in state 0, can step up/down in their state value, and then exit. The numbers of individuals and time-steps are very large so I only save the times of entry, state-transitions (up, down), and exit. From these logs, I want to reconstruct the mean state value among active individuals (after entry & before exit) for all time.
Problem / Question: The solution below is not very fast (especially for large ni
and tf
). Can we obtain val.t / act.t
more efficiently? Ideally avoiding the loop, but I'm not sure if that's possible due to the ragged nature of to.i
and tx.i
(each individual experiences a random number of events). I know we can speed-up with parallel::
but I'm already using all CPUs with model instances in parallel.
MWE
# config
set.seed(1)
ni = 1000 # num individuals (i)
ne = 100 # mean num state change events per i
tf = 365*100 # total timesteps
# generate event times (dummy model)
t0.i = runif(ni,1,tf/2) # model entry time per i
tf.i = t0.i + tf/2 # model exit time per i
ne.i = rpois(ni,ne) + 1 # num events per i
to.i = lapply(1:ni,function(i){ round(runif(ne.i[i],t0.i[i],tf.i[i])) }) # state up times
tx.i = lapply(1:ni,function(i){ to.i[[i]] + rpois(ne.i[i],365) }) # state down times
# mean value among active
act.t = numeric(tf) # will hold total num active
val.t = numeric(tf) # will hold total state value among active
for (i in 1:ni){ # for each i
ti = t0.i[i]:tf.i[i] # timesteps active
val.i = cumsum(tabulate(to.i[[i]],tf)-tabulate(tx.i[[i]],tf)) # state value
act.t[ti] = act.t[ti] + 1 # add to active
val.t[ti] = val.t[ti] + val.i[ti] # add to state value
}
# plot
plot(val.t/act.t,type='l') # mean value among active
lines(act.t/ni,col='red') # proportion active
Plot (FYI)
Here's a tidyverse approach that is consistent with your output and 5-10x faster for the example size of data (0.15 sec vs. 1.3 sec), while providing output that matches an arbitrarily large number of timesteps.
My approach appears to be about 5-10x faster for the same ni
compared to the OP method using tf = 365*100
. For instance, on my computer, ni = 100000
takes 121 seconds using the OP method and 16 seconds using mine.
I share this an an example of an orthogonal approach that could surely be made more efficient with more thought and optimization. It could also presumably be made to work yet faster using data.table
, duckdb
, collapse
, or polars
, which can in many situations (especially with large numbers of grouped calculations) can be faster than base R or tidyverse
approaches.
This approach tabulates at each change point, rather than at each tf
point for each individual, so it uses essentially infinite time granularity, while performing many fewer calculations. The OP method takes 10x time for each 10x increase in time granularity, so in some trivial sense this approach is many orders of magnitude faster if you want very detailed time granularity. Time granularity costs a lot in the OP method, but comes "for free" with this one.
First, I put the data with the ids, ups, and downs into a data frame. Then I reshape that long, so each up or down gets a row of its own. I also add rows for each id's entry and exit. I only keep the ups/downs that occur between each id
's entry and exit.
I track each id's state as being the cumulative sum of its ups minus its cumulative downs. Similarly, the overall active_n
is the sum of cumulative entries less cumulative exits. Finally, the summed val
adds all the cumulative changes minus the loss of each id's state at the time of exit.
library(tidyverse)
df <- data.frame(id = rep(1:ni, ne.i),
up = unlist(to.i),
down = unlist(tx.i))
df2 <- df |>
pivot_longer(up:down, values_to = "time") |>
mutate(chg = if_else(name == "up", 1, -1)) |>
add_row(id = unique(df$id), name = "entry", time = t0.i, chg = 0) |>
add_row(id = unique(df$id), name = "exit", time = tf.i, chg = 0) |>
arrange(time) |>
filter(time >= time[name == "entry"], .by = id) |>
filter(time <= time[name == "exit"], .by = id) |>
mutate(state = cumsum(chg), .by = id) |>
mutate(active_n = cumsum(name == "entry") - cumsum(name == "exit") ,
val = cumsum(chg) - cumsum(state * if_else(name == "exit", 1, 0)),
mean_val = val / active_n)
Here's a visual demonstration that the two approaches match for the example scenario in the OP:
ggplot(df2, aes(time, mean_val)) +
geom_step() +
geom_step(aes(y = active_n/ni), linewidth = 1.5) +
# for comparison in white
geom_step(data = data.frame(time = 1:tf, mean_val = val.t/act.t),
color = "white", linewidth = 0.2) +
geom_step(data = data.frame(time = 1:tf, mean_val = act.t/ni),
color = "white", linewidth = 0.5)
Here's a visual demonstration that the two approaches match for ni = 100000
:
df2 |>
slice_sample(n = 50000) |> # to constrain plot time
ggplot(aes(time, mean_val)) +
geom_step() +
geom_step(aes(y = active_n/ni), linewidth = 1.5) +
# for comparison in white
geom_step(data = data.frame(time = 1:tf, mean_val = val.t/act.t),
color = "white", linewidth = 0.2) +
geom_step(data = data.frame(time = 1:tf, mean_val = act.t/ni),
color = "white", linewidth = 0.5)
Below is a fully vectorized base R approach. It provides a ~75x speedup.
# config
set.seed(1)
ni <- 1000 # num individuals (i)
ne <- 100 # mean num state change events per i
tf <- 365*100 # total timesteps
t0.i <- runif(ni,1,tf/2) # model entry time per i
tf.i <- t0.i + tf/2 # model exit time per i
ne.i <- rpois(ni,ne) + 1 # num events per i
# generate event times (dummy model)
ne.tot <- sum(ne.i)
to.i <- round(runif(ne.tot, rep.int(t0.i, ne.i), rep.int(tf.i, ne.i)))
tx.i <- pmin(to.i + rpois(ne.tot, 365), rep.int(tf.i + 1, ne.i))
act.t <- cumsum(tabulate(t0.i, tf) - tabulate(tf.i + 1, tf))
val.t <- cumsum(tabulate(to.i, tf) - tabulate(tx.i, tf))
Benchmarking:
f1 <- function() {
# config
set.seed(1)
ni = 1000 # num individuals (i)
ne = 100 # mean num state change events per i
tf = 365*100 # total timesteps
# generate event times (dummy model)
t0.i = runif(ni,1,tf/2) # model entry time per i
tf.i = t0.i + tf/2 # model exit time per i
ne.i = rpois(ni,ne) + 1 # num events per i
to.i = lapply(1:ni,function(i){ round(runif(ne.i[i],t0.i[i],tf.i[i])) }) # state up times
tx.i = lapply(1:ni,function(i){ to.i[[i]] + rpois(ne.i[i],365) }) # state down times
# mean value among active
act.t = numeric(tf) # will hold total num active
val.t = numeric(tf) # will hold total state value among active
for (i in 1:ni){ # for each i
ti = t0.i[i]:tf.i[i] # timesteps active
val.i = cumsum(tabulate(to.i[[i]],tf)-tabulate(tx.i[[i]],tf)) # state value
act.t[ti] = act.t[ti] + 1 # add to active
val.t[ti] = val.t[ti] + val.i[ti] # add to state value
}
list(act.t = act.t, val.t = val.t)
}
f2 <- function() {
# config
set.seed(1)
ni <- 1000 # num individuals (i)
ne <- 100 # mean num state change events per i
tf <- 365*100 # total timesteps
t0.i <- runif(ni,1,tf/2) # model entry time per i
tf.i <- t0.i + tf/2 # model exit time per i
ne.i <- rpois(ni,ne) + 1 # num events per i
# generate event times (dummy model)
ne.tot <- sum(ne.i)
to.i <- round(runif(ne.tot, rep.int(t0.i, ne.i), rep.int(tf.i, ne.i)))
tx.i <- pmin(to.i + rpois(ne.tot, 365), rep.int(tf.i + 1, ne.i))
list(act.t = cumsum(tabulate(t0.i, tf) - tabulate(tf.i + 1, tf)),
val.t = cumsum(tabulate(to.i, tf) - tabulate(tx.i, tf)))
}
Result:
microbenchmark::microbenchmark(
f1 = f1(),
f2 = f2(),
times = 10,
check = "equal",
unit = "relative"
)
#> Unit: relative
#> expr min lq mean median uq max neval cld
#> f1 82.23186 82.00272 78.94775 91.66146 91.71763 47.53844 10 a
#> f2 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000 10 b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With