I'm interested in using dplyr to construct bootstrap replications (repeated analyses where the data is first sampled with replacement each time). Hadley Wickham here provides some code for repeating bootstrapped analyses in an efficient way:
bootstrap <- function(df, m) {
n <- nrow(df)
attr(df, "indices") <- replicate(m, sample(n, replace = TRUE),
simplify = FALSE)
attr(df, "drop") <- TRUE
attr(df, "group_sizes") <- rep(n, m)
attr(df, "biggest_group_size") <- n
attr(df, "labels") <- data.frame(replicate = 1:m)
attr(df, "vars") <- list(quote(boot)) # list(substitute(bootstrap(m)))
class(df) <- c("grouped_df", "tbl_df", "tbl", "data.frame")
df
}
library(dplyr)
mboot <- bootstrap(mtcars, 10)
# Works
mboot %.% summarise(mean(cyl))
While this function works well for summarise
, it doesn't work for do
when do
contains a data.frame. (Imagine for now that the data.frame contains something useful such as the results of the analysis we wish to bootstrap).
bootstrap(mtcars, 3) %>% do(data.frame(x=1:2))
# Error: index out of bounds
with the traceback
11: stop(list(message = "index out of bounds", call = NULL, cppstack = NULL))
10: .Call("dplyr_grouped_df_impl", PACKAGE = "dplyr", data, symbols,
drop)
9: grouped_df_impl(data, unname(vars), drop)
8: grouped_df(cbind_list(labels, out), groups)
7: label_output_dataframe(labels, out, groups(.data))
6: do.grouped_df(`bootstrap(mtcars, 3)`, data.frame(x = 1:2))
5: do(`bootstrap(mtcars, 3)`, data.frame(x = 1:2))
4: eval(expr, envir, enclos)
3: eval(e, env)
2: withVisible(eval(e, env))
1: bootstrap(mtcars, 3) %>% do(data.frame(x = 1:2))
I was able to work around this by performing two do
steps and a group by:
bootstrap(mtcars, 10) %>% do(d=data.frame(x=1:2)) %>% group_by(replicate) %>% do(.$d[[1]])
but this seems to require a lot of extra, and somewhat clumsy, steps (and also gets a warning, Grouping rowwise data frame strips rowwise nature
). I'm also aware that I could replicate the data into ten replications first with something like
data.frame(boot=1:10) %>% group_by(boot) %>% do(sample_n(mtcars, nrow(mtcars), replace=TRUE))
but if the data or the number of bootstrap replicates is large this is extremely inefficient in memory.
Is there a way, perhaps by altering the bootstrap
setup function, that I can perform these replicates with bootstrap(mtcars, 3) %>% do(data.frame(x = 1:2))
?
I think it is a small bug in the bootstrap
function. The vars
attribute should match the column name in the data.frame
in the labels
attribute. But in the function, the vars
attribute is called "boot"
, and the column name is replicate
. So, if you make this minor change:
bootstrap <- function(df, m) {
n <- nrow(df)
attr(df, "indices") <- replicate(m, sample(n, replace = TRUE),
simplify = FALSE)
attr(df, "drop") <- TRUE
attr(df, "group_sizes") <- rep(n, m)
attr(df, "biggest_group_size") <- n
attr(df, "labels") <- data.frame(replicate = 1:m)
attr(df, "vars") <- list(quote(replicate)) # Change
# attr(df, "vars") <- list(quote(boot)) # list(substitute(bootstrap(m)))
class(df) <- c("grouped_df", "tbl_df", "tbl", "data.frame")
df
}
Then it works as expected:
bootstrap(mtcars, 3) %>% do(data.frame(x=1:2))
# Source: local data frame [6 x 2]
# Groups: replicate
# replicate x
# 1 1 1
# 2 1 2
# 3 2 1
# 4 2 2
# 5 3 1
# 6 3 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With