This is one of those things that feels like there should be a function/library for it, but I'm looking through the docs and not spotting anything that feels like it does it inherently for what I'm seeing as my use case, unless I am misunderstanding something.
I have a dataset which looks something like this, at a row level:
| client | row_id | observed | expected |
|---|---|---|---|
| 1 | 1 | 2.3 | 3.5 |
| 1 | 2 | 10.2 | 9.8 |
| 2 | 1 | 3.4 | 10.3 |
Each client is a different size - ranging between a few hundred rows and several hundred thousand. There's about two hundred clients.
I then summarise that row level data up, counting the number of rows and totalling the 'observed' and 'expected' values by client and creating a third "observed over expected" field. The result looks something like this:
| client | rows | tot_observed | tot_expected | o_over_e |
|---|---|---|---|---|
| 1 | 3654 | 45435.4 | 50434.3 | 0.900882 |
| 2 | 234 | 1022.6 | 949.2 | 1.077328 |
I am looking to chart this O/E result by client, but want to add a confidence interval, basically indicating the result is a lot more uncertain for the small clients than the big ones.
One possible option for doing this is to basically do bootstrapping - take repeated samples with replacement of each client's data (with sample sizes relative to their actual row counts), and then do the 2.5 and 97.5 percentiles of the sample results. But I haven't found a function that seems to work, and my attempts to do that in a 'roll my own' fashion are untenably slow, because R hates loops with a passion, heh:
set.seed(1234)
for(bootstrap_id in 1:100){
print(bootstrap_id)
for(client in 1:nrow(df_summary)){
client_id <- df_summary[[client,1]]
client_rowcount <- as.integer(df_summary[[client,2]]/2)
client_bootstrap_sample_ids <- base::sample(1:client_rowcount, client_rowcount, replace=TRUE)
client_bootstrap_sample_ids_table <- cbind(bootstrap_id = bootstrap_id, client_id = client_id, row_index = client_bootstrap_sample_ids)
bootstrap_indexes <- rbind(bootstrap_indexes, client_bootstrap_sample_ids_table)
}
}
This approach didn't even get as far as attempting to do the actual percentile calculation, because nested loops is not a thing that R appreciates doing. It's probably not helped by the variable sample sizes, either. (It was originally attempting 1000 iterations instead of 100 and a sample the same size as the client rather than half, but I cut the numbers down in hopes of increasing the speed a bit. It didn't help enough.)
I did find this question which on first brush seems like it would answer the question, but since the observed and expected values for each row are paired and the o_over_e is intended to be the total of observed divided by the total of expected, it feels like it would make more sense to take paired bootstrap samples and then calculate the o_over_e off the totals of for each sample. Maybe I'm wrong.
Is there some way to do this that makes sense? Am I approaching the entire problem from the wrong direction and there is some other thing I should be doing instead that would work better and not be fighting the entire system? For reference, the values are not particularly normally distributed - the dataset is extremely long tailed.
First, some reproducible sample data. (You don't need this part since you have your real data.)
nclients <- 10
set.seed(1)
client_obs <- floor(runif(nclients, 300, 100000))
dat <- data.frame(client = rep(1:nclients, times = client_obs)) |>
transform(
row_id = as.integer(ave(client, client, FUN = seq_along)),
observed = runif(length(client), 0, 20),
expected = runif(length(client), 0, 20)
)
head(dat)
# client row_id observed expected
# 1 1 1 4.119491 19.175472
# 2 1 2 3.531135 1.951867
# 3 1 3 13.740457 10.966097
# 4 1 4 7.682074 8.241934
# 5 1 5 15.396828 9.625548
# 6 1 6 9.953985 13.634540
table(dat$client)
# 1 2 3 4 5 6 7 8 9 10
# 26771 37400 57413 90848 20407 89869 94484 66181 63022 6460
Note: in all of the below, I use set.seed(.) so that you and others can reproduce these exact results. It is unlikely that you would want to use seeds in production; the most I could imagine would be setting a random seed and storing it for records. That is, something like Setting seed for nested parallel simulation in R and storing the states of the random-number generator but without the parallelization of the whole task. If you don't know that you need this, then in production just remove all of my set.seed(.) calls.
First, a quick demonstration on one of those clients:
dat1 <- subset(dat, client == 1)
nrow(dat1)
# [1] 26771
head(dat1, 10)
# client row_id observed expected
# 1 1 1 4.119491 19.175472
# 2 1 2 3.531135 1.951867
# 3 1 3 13.740457 10.966097
# 4 1 4 7.682074 8.241934
# 5 1 5 15.396828 9.625548
# 6 1 6 9.953985 13.634540
# 7 1 7 14.352370 18.183936
# 8 1 8 19.838122 2.844385
# 9 1 9 7.600704 6.215613
# 10 1 10 15.548904 4.060587
lapply(dat1[, c("observed", "expected")], sum)
# $observed
# [1] 267100.7
# $expected
# [1] 265534.7
set.seed(2025)
replicate(100, {
ind <- sample(nrow(dat1), ceiling(nrow(dat1)/2), replace=TRUE)
with(dat1, sum(observed[ind]) / sum(expected[ind]))
}) |>
quantile(c(0.025, 0.975))
# 2.5% 97.5%
# 0.9914751 1.0204476
I think that's what you're attempting to do, where the size of the sampled data is around half of the number of observations (arbitrary, I just chose something).
Assuming that's what you want, let's repeat that over the whole collection:
set.seed(2025)
split(dat, ~ client) |>
lapply(function(dat1) {
quants <- replicate(100, {
ind <- sample(nrow(dat1), ceiling(nrow(dat1)/2), replace=TRUE)
with(dat1, sum(observed[ind]) / sum(expected[ind]))
}) |>
quantile(c(0.025, 0.975))
cbind.data.frame(
data.frame(client = dat1$client[1]),
lapply(dat1[, c("observed", "expected")], sum),
# o_over_e = do.call(`/`, sums),
setNames(as.list(quants), c("q025", "q975"))
)
}) |>
do.call(rbind, args = _) |>
transform(o_over_e = observed / expected)
# client observed expected q025 q975 o_over_e
# 1 1 267100.68 265534.69 0.9914751 1.020448 1.0058975
# 2 2 375545.22 373928.03 0.9945105 1.014635 1.0043249
# 3 3 571633.90 574635.28 0.9883750 1.004189 0.9947769
# 4 4 908792.89 909724.32 0.9910776 1.007152 0.9989761
# 5 5 206508.27 204020.88 0.9961232 1.027182 1.0121918
# 6 6 897564.03 899627.80 0.9909475 1.005245 0.9977060
# 7 7 945769.87 943177.96 0.9943563 1.008679 1.0027481
# 8 8 659224.28 663016.10 0.9850789 1.001328 0.9942810
# 9 9 629712.19 630786.03 0.9887445 1.008196 0.9982976
# 10 10 65376.58 64696.61 0.9827206 1.041569 1.0105102
The split(.) |> lapply(..) is one way to do stuff "by group" in a frame. I could have used by() as well, there are others, see Calculate the mean by group, How to sum a variable by group, Apply several summary functions (sum, mean, etc.) on several variables by group in one call (multiple functions) for more methods.
Internally, the replicate(100, ..) is a fixed bootstrap size for all clients; feel free to change this to what you want, or to vary it by client (though I think that does not necessarily fill your need). Its expression samples half or half-plus-1 of its rows, and computes the ratio of sums.
The cbind combines the components ($client[1], the sums of the columns, and the quantiles of bootstrapped ratios) into a single frame.
The last step just adds your raw observed/expected ratio of the sums.
If you are considering using the dplyr package, the above can be replicated with this:
library(dplyr)
dat |>
# ordering only for reproducibility since order of groups is not assured otherwise,
# you do not need this `arrange()` in production
arrange(client) |>
summarize(
.by = client,
as.data.frame(as.list(quantile(replicate(100, {
ind <- sample(n(), ceiling(n()/2), replace=TRUE)
sum(observed[ind]) / sum(expected[ind])
}), c(0.025, 0.975)))) |> setNames(c("q025", "q975")),
across(c(observed, expected), ~ sum(.x)),
o_over_e = observed / expected
)
# client q025 q975 observed expected o_over_e
# 1 1 0.9914751 1.020448 267100.68 265534.69 1.0058975
# 2 2 0.9945105 1.014635 375545.22 373928.03 1.0043249
# 3 3 0.9883750 1.004189 571633.90 574635.28 0.9947769
# 4 4 0.9910776 1.007152 908792.89 909724.32 0.9989761
# 5 5 0.9961232 1.027182 206508.27 204020.88 1.0121918
# 6 6 0.9909475 1.005245 897564.03 899627.80 0.9977060
# 7 7 0.9943563 1.008679 945769.87 943177.96 1.0027481
# 8 8 0.9850789 1.001328 659224.28 663016.10 0.9942810
# 9 9 0.9887445 1.008196 629712.19 630786.03 0.9982976
# 10 10 0.9827206 1.041569 65376.58 64696.61 1.0105102
The numerical results are the same, even if the column order is not. The order of calculations within summarize is important, you cannot change the order of calcs and get the desired results. If you want the columns in a different order, use select(..) or relocate(..) to get what you need.
With 200 clients with numbers of observations ranging from 300 to 100,000, the produced data has 9,975,752 total rows. Both approaches took under 14 seconds on my machine. It might be possible to squeeze some more performance out of it, over to you if it's worth the effort. (You might get a little faster using the data.table or collapse packages, though if you are not already familiar them and you aren't certain you need the extra speed, it may not be worth your time right now.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With