Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently reconstruct state variable from event times in R

Tags:

performance

r

I have a discrete-time stochastic model where individuals enter in state 0, can step up/down in their state value, and then exit. The numbers of individuals and time-steps are very large so I only save the times of entry, state-transitions (up, down), and exit. From these logs, I want to reconstruct the mean state value among active individuals (after entry & before exit) for all time.

Problem / Question: The solution below is not very fast (especially for large ni and tf). Can we obtain val.t / act.t more efficiently? Ideally avoiding the loop, but I'm not sure if that's possible due to the ragged nature of to.i and tx.i (each individual experiences a random number of events). I know we can speed-up with parallel:: but I'm already using all CPUs with model instances in parallel.

MWE

# config
set.seed(1)
ni = 1000    # num individuals (i)
ne = 100     # mean num state change events per i
tf = 365*100 # total timesteps
# generate event times (dummy model)
t0.i = runif(ni,1,tf/2)   # model entry time per i
tf.i = t0.i + tf/2        # model exit time per i
ne.i = rpois(ni,ne) + 1   # num events per i
to.i = lapply(1:ni,function(i){ round(runif(ne.i[i],t0.i[i],tf.i[i])) }) # state up times
tx.i = lapply(1:ni,function(i){ to.i[[i]] + rpois(ne.i[i],365) })        # state down times
# mean value among active
act.t = numeric(tf) # will hold total num active
val.t = numeric(tf) # will hold total state value among active
for (i in 1:ni){        # for each i
  ti = t0.i[i]:tf.i[i]  # timesteps active
  val.i = cumsum(tabulate(to.i[[i]],tf)-tabulate(tx.i[[i]],tf)) # state value
  act.t[ti] = act.t[ti] + 1         # add to active
  val.t[ti] = val.t[ti] + val.i[ti] # add to state value
}
# plot
plot(val.t/act.t,type='l') # mean value among active
lines(act.t/ni,col='red') # proportion active

Plot (FYI)

output

like image 298
jessexknight Avatar asked Sep 20 '25 04:09

jessexknight


2 Answers

Here's a tidyverse approach that is consistent with your output and 5-10x faster for the example size of data (0.15 sec vs. 1.3 sec), while providing output that matches an arbitrarily large number of timesteps.

My approach appears to be about 5-10x faster for the same ni compared to the OP method using tf = 365*100. For instance, on my computer, ni = 100000 takes 121 seconds using the OP method and 16 seconds using mine.

I share this an an example of an orthogonal approach that could surely be made more efficient with more thought and optimization. It could also presumably be made to work yet faster using data.table, duckdb, collapse, or polars, which can in many situations (especially with large numbers of grouped calculations) can be faster than base R or tidyverse approaches.

This approach tabulates at each change point, rather than at each tf point for each individual, so it uses essentially infinite time granularity, while performing many fewer calculations. The OP method takes 10x time for each 10x increase in time granularity, so in some trivial sense this approach is many orders of magnitude faster if you want very detailed time granularity. Time granularity costs a lot in the OP method, but comes "for free" with this one.


First, I put the data with the ids, ups, and downs into a data frame. Then I reshape that long, so each up or down gets a row of its own. I also add rows for each id's entry and exit. I only keep the ups/downs that occur between each id's entry and exit.

I track each id's state as being the cumulative sum of its ups minus its cumulative downs. Similarly, the overall active_n is the sum of cumulative entries less cumulative exits. Finally, the summed val adds all the cumulative changes minus the loss of each id's state at the time of exit.

library(tidyverse)

df <- data.frame(id = rep(1:ni, ne.i),
                 up = unlist(to.i),
                 down = unlist(tx.i))

df2 <- df |> 
  pivot_longer(up:down, values_to = "time") |>
  mutate(chg = if_else(name == "up", 1, -1)) |>
  add_row(id = unique(df$id), name = "entry", time = t0.i, chg = 0) |>
  add_row(id = unique(df$id), name = "exit", time = tf.i, chg = 0) |>
  arrange(time) |> 
  filter(time >= time[name == "entry"], .by = id) |>
  filter(time <= time[name == "exit"], .by = id) |>
  mutate(state = cumsum(chg), .by = id) |>
  
  mutate(active_n = cumsum(name == "entry") - cumsum(name == "exit") ,
         val = cumsum(chg) - cumsum(state * if_else(name == "exit", 1, 0)),
         mean_val = val / active_n) 

Here's a visual demonstration that the two approaches match for the example scenario in the OP:

ggplot(df2, aes(time, mean_val)) +
  geom_step() +
  geom_step(aes(y = active_n/ni), linewidth = 1.5) +
  
  # for comparison in white
  geom_step(data = data.frame(time = 1:tf, mean_val = val.t/act.t),
            color = "white", linewidth = 0.2) +
  geom_step(data = data.frame(time = 1:tf, mean_val = act.t/ni),
            color = "white", linewidth = 0.5)

enter image description here

Here's a visual demonstration that the two approaches match for ni = 100000:

df2 |>
  slice_sample(n = 50000) |>   # to constrain plot time
ggplot(aes(time, mean_val)) +
  geom_step() +
  geom_step(aes(y = active_n/ni), linewidth = 1.5) +
  
  # for comparison in white
  geom_step(data = data.frame(time = 1:tf, mean_val = val.t/act.t),
            color = "white", linewidth = 0.2) +
  geom_step(data = data.frame(time = 1:tf, mean_val = act.t/ni),
            color = "white", linewidth = 0.5)

enter image description here

like image 197
Jon Spring Avatar answered Sep 22 '25 18:09

Jon Spring


Below is a fully vectorized base R approach. It provides a ~75x speedup.

# config
set.seed(1)
ni <- 1000    # num individuals (i)
ne <- 100     # mean num state change events per i
tf <- 365*100 # total timesteps
t0.i <- runif(ni,1,tf/2)   # model entry time per i
tf.i <- t0.i + tf/2        # model exit time per i
ne.i <- rpois(ni,ne) + 1   # num events per i
# generate event times (dummy model)
ne.tot <- sum(ne.i)
to.i <- round(runif(ne.tot, rep.int(t0.i, ne.i), rep.int(tf.i, ne.i)))
tx.i <- pmin(to.i + rpois(ne.tot, 365), rep.int(tf.i + 1, ne.i))
act.t <- cumsum(tabulate(t0.i, tf) - tabulate(tf.i + 1, tf))
val.t <- cumsum(tabulate(to.i, tf) - tabulate(tx.i, tf))

Benchmarking:

f1 <- function() {
  # config
  set.seed(1)
  ni = 1000    # num individuals (i)
  ne = 100     # mean num state change events per i
  tf = 365*100 # total timesteps
  # generate event times (dummy model)
  t0.i = runif(ni,1,tf/2)   # model entry time per i
  tf.i = t0.i + tf/2        # model exit time per i
  ne.i = rpois(ni,ne) + 1   # num events per i
  to.i = lapply(1:ni,function(i){ round(runif(ne.i[i],t0.i[i],tf.i[i])) }) # state up times
  tx.i = lapply(1:ni,function(i){ to.i[[i]] + rpois(ne.i[i],365) })        # state down times
  # mean value among active
  act.t = numeric(tf) # will hold total num active
  val.t = numeric(tf) # will hold total state value among active
  for (i in 1:ni){        # for each i
    ti = t0.i[i]:tf.i[i]  # timesteps active
    val.i = cumsum(tabulate(to.i[[i]],tf)-tabulate(tx.i[[i]],tf)) # state value
    act.t[ti] = act.t[ti] + 1         # add to active
    val.t[ti] = val.t[ti] + val.i[ti] # add to state value
  }
  list(act.t = act.t, val.t = val.t)
}

f2 <- function() {
  # config
  set.seed(1)
  ni <- 1000    # num individuals (i)
  ne <- 100     # mean num state change events per i
  tf <- 365*100 # total timesteps
  t0.i <- runif(ni,1,tf/2)   # model entry time per i
  tf.i <- t0.i + tf/2        # model exit time per i
  ne.i <- rpois(ni,ne) + 1   # num events per i
  # generate event times (dummy model)
  ne.tot <- sum(ne.i)
  to.i <- round(runif(ne.tot, rep.int(t0.i, ne.i), rep.int(tf.i, ne.i)))
  tx.i <- pmin(to.i + rpois(ne.tot, 365), rep.int(tf.i + 1, ne.i))
    list(act.t = cumsum(tabulate(t0.i, tf) - tabulate(tf.i + 1, tf)),
         val.t = cumsum(tabulate(to.i, tf) - tabulate(tx.i, tf)))
}

Result:

microbenchmark::microbenchmark(
  f1 = f1(),
  f2 = f2(),
  times = 10,
  check = "equal",
  unit = "relative"
)
#> Unit: relative
#>  expr      min       lq     mean   median       uq      max neval cld
#>    f1 82.23186 82.00272 78.94775 91.66146 91.71763 47.53844    10  a 
#>    f2  1.00000  1.00000  1.00000  1.00000  1.00000  1.00000    10   b
like image 32
jblood94 Avatar answered Sep 22 '25 19:09

jblood94