Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using dplyr's do to perform bootstrap replications

Tags:

r

dplyr

I'm interested in using dplyr to construct bootstrap replications (repeated analyses where the data is first sampled with replacement each time). Hadley Wickham here provides some code for repeating bootstrapped analyses in an efficient way:

bootstrap <- function(df, m) {
  n <- nrow(df)

  attr(df, "indices") <- replicate(m, sample(n, replace = TRUE), 
    simplify = FALSE)
  attr(df, "drop") <- TRUE
  attr(df, "group_sizes") <- rep(n, m)
  attr(df, "biggest_group_size") <- n
  attr(df, "labels") <- data.frame(replicate = 1:m)
  attr(df, "vars") <- list(quote(boot)) # list(substitute(bootstrap(m)))
  class(df) <- c("grouped_df", "tbl_df", "tbl", "data.frame")

  df
}

library(dplyr)
mboot <- bootstrap(mtcars, 10)

# Works
mboot %.% summarise(mean(cyl))

While this function works well for summarise, it doesn't work for do when do contains a data.frame. (Imagine for now that the data.frame contains something useful such as the results of the analysis we wish to bootstrap).

bootstrap(mtcars, 3) %>% do(data.frame(x=1:2))
# Error: index out of bounds

with the traceback

11: stop(list(message = "index out of bounds", call = NULL, cppstack = NULL))
10: .Call("dplyr_grouped_df_impl", PACKAGE = "dplyr", data, symbols, 
        drop)
9: grouped_df_impl(data, unname(vars), drop)
8: grouped_df(cbind_list(labels, out), groups)
7: label_output_dataframe(labels, out, groups(.data))
6: do.grouped_df(`bootstrap(mtcars, 3)`, data.frame(x = 1:2))
5: do(`bootstrap(mtcars, 3)`, data.frame(x = 1:2))
4: eval(expr, envir, enclos)
3: eval(e, env)
2: withVisible(eval(e, env))
1: bootstrap(mtcars, 3) %>% do(data.frame(x = 1:2))

I was able to work around this by performing two do steps and a group by:

bootstrap(mtcars, 10) %>% do(d=data.frame(x=1:2)) %>% group_by(replicate) %>% do(.$d[[1]])

but this seems to require a lot of extra, and somewhat clumsy, steps (and also gets a warning, Grouping rowwise data frame strips rowwise nature). I'm also aware that I could replicate the data into ten replications first with something like

data.frame(boot=1:10) %>% group_by(boot) %>% do(sample_n(mtcars, nrow(mtcars), replace=TRUE))

but if the data or the number of bootstrap replicates is large this is extremely inefficient in memory.

Is there a way, perhaps by altering the bootstrap setup function, that I can perform these replicates with bootstrap(mtcars, 3) %>% do(data.frame(x = 1:2))?

like image 351
David Robinson Avatar asked Sep 11 '14 17:09

David Robinson


1 Answers

I think it is a small bug in the bootstrap function. The vars attribute should match the column name in the data.frame in the labels attribute. But in the function, the vars attribute is called "boot", and the column name is replicate. So, if you make this minor change:

bootstrap <- function(df, m) {
  n <- nrow(df)

  attr(df, "indices") <- replicate(m, sample(n, replace = TRUE), 
                                   simplify = FALSE)
  attr(df, "drop") <- TRUE
  attr(df, "group_sizes") <- rep(n, m)
  attr(df, "biggest_group_size") <- n
  attr(df, "labels") <- data.frame(replicate = 1:m)
  attr(df, "vars") <- list(quote(replicate)) # Change
#  attr(df, "vars") <- list(quote(boot)) # list(substitute(bootstrap(m)))
  class(df) <- c("grouped_df", "tbl_df", "tbl", "data.frame")

  df
}

Then it works as expected:

bootstrap(mtcars, 3) %>% do(data.frame(x=1:2))
# Source: local data frame [6 x 2]
# Groups: replicate

#   replicate x
# 1         1 1
# 2         1 2
# 3         2 1
# 4         2 2
# 5         3 1
# 6         3 2
like image 183
nograpes Avatar answered Oct 19 '22 18:10

nograpes