Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Remove nested for loops in order to make a custom bootstrap more efficient

I am trying to gather some bootstrapped estimates for summary statistics from a dataset, but I want to resample parts of the dataset at different rates, which has led me to lean on nested for loops.

Specifically, suppose there are two groups in my dataset, and each group is further divided into test and control. Group 1 has a 75% / 25% test-control ratio, and Group 2 has a 50% / 50% test-control ratio.

I want to resample such that the dataset is the same size, but the test-control ratios are 90% / 10% for both groups... in other words, resample different subgroups at different rates, which strikes me as different from what the boot package normally does.

In my dataset, I created a group variable representing the groups, and a groupT variable representing group concatenated with test/control, e.g.:

    id     group     groupT
     1         1         1T
     2         1         1T
     3         2         2T
     4         1         1C
     5         2         2C

Here's what I am running right now, with nreps arbitrarily set to be my number of bootstrap replications:

for (j in 1:nreps){

  bootdat <- datafile[-(1:nrow(datafile)),] ## initialize empty dataset

  for (i in unique(datafile$groups)){

    tstring<-paste0(i,"T") ## e.g. 1T
    cstring<-paste0(i,"C") ## e.g. 1C

    ## Size of test group resample should be ~90% of total group size

    tsize<-round(.90*length(which(datafile$groups==i)),0)

    ## Size of control group resample should be total group size minus test group size

    csize<-length(which(datafile$groups==i))-tsize

    ## Continue building bootdat by rbinding the test and control resample

    ## before moving on to the next group
    ## Note the use of datafile$groupT==tstring to ensure I'm only sampling from test, etc.

    bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==tstring),size=tsize,
      replace=TRUE),])

    bootdat<-rbind(bootdat,datafile[sample(which(datafile$groupT==cstring),size=csize,
      replace=TRUE),])
  }

  ## Here, there is code to grab some summary statistics from bootdat
  ## and store them in statVector[j] before moving on to the next replication
}

With a dataset size of about 1 million total records, this takes 3-4 minutes per replication. I feel certain there is a better way to do this either with sapply or possibly some of the dplyr functions, but I have come up empty in my attempts so far. Any help would be appreciated!

like image 881
dw0914 Avatar asked Mar 14 '18 12:03

dw0914


2 Answers

I'd strongly encourage you to look into data.table and foreach, using keyed searches for bootstraps. It'll allow you to do a single bootstrap very rapidly, and you can run each bootstrap independently on a different core. Each bootstrap of the below takes 0.5 seconds on my machine, searching through a table of 1 million rows. Something like the following should get you started:

library(data.table)
library(foreach)
library(doMC)
registerDoMC(cores=4)

# example data
    dat <- data.table(id=1:1e6, group=sample(2, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))


# define number of bootstraps
    nBootstraps <- 1000

# define sampling fractions
    fraction_test <- 0.90
    fraction_control <- 1 - fraction_test

# get number that you want to sample from each group
    N.test <- round(fraction_test * dim(dat)[1])
    N.control <- round(fraction_control * dim(dat)[1])

# key data by id
    setkey(dat, id)

# get ID values for each combination, to be used for keyed search during bootstrapping
group1_test_ids <- dat[group==1 & test_control=="T"]$id
group1_control_ids <- dat[group==1 & test_control=="C"]$id
group2_test_ids <- dat[group==2 & test_control=="T"]$id
group2_control_ids <- dat[group==2 & test_control=="C"]$id


results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %dopar% {

    # sample each group with the defined sizes, with replacement
    g1T <- dat[.(sample(group1_test_ids, size=N.test, replace=TRUE))]
    g1C <- dat[.(sample(group1_control_ids, size=N.control, replace=TRUE))]
    g2T <- dat[.(sample(group2_test_ids, size=N.test, replace=TRUE))]
    g2C <- dat[.(sample(group2_control_ids, size=N.control, replace=TRUE))]
    dat.all <- rbindlist(list(g1T, g1C, g2T, g2C))
    dat.all[, bootstrap := n]

    # do summary stats here with dat.all, return the summary stats data.table object
        return(dat.summarized)

}

EDIT: example below includes a lookup table for each of any arbitrary number of unique groups. The IDs corresponding to each combination of group + (test OR control) can be referenced within a foreach loop for simplicity. With lower numbers for N.test and N.control (900 and 100) it spits out the results of 1000 bootstraps in

library(data.table)
library(foreach)

# example data
    dat <- data.table(id=1:1e6, group=sample(24, size=1e6, replace=TRUE), test_control=sample(c("T","C"), size=1e5, replace=TRUE))

# save vector of all group values & change group to character vector for hashed environment lookup
    all_groups <- as.character(sort(unique(dat$group)))
    dat[, group := as.character(group)]


# define number of bootstraps
    nBootstraps <- 100

# get number that you want to sample from each group
    N.test <- 900
    N.control <- 100

# key data by id
    setkey(dat, id)

# all values for group

# Set up lookup table for every combination of group + test/control
    control.ids <- new.env()
    test.ids <- new.env()

    for(i in all_groups) {
        control.ids[[i]] <- dat[group==i & test_control=="C"]$id
        test.ids[[i]] <- dat[group==i & test_control=="T"]$id
    }


results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
    foreach(group.i = all_groups, .combine="rbind") %do% {

        # get IDs that correspond to this group, for both test and control
        control_id_vector <- control.ids[[group.i]]
        test_id_vector <- test.ids[[group.i]]

        # search and bind
        controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
        tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
        dat.group <- rbindlist(list(controls, tests))
        dat.group[, bootstrap := n]
        return(dat.group[])
    }
    # summarize across all groups for this bootstrap and return summary stat data.table object

}

yielding

> results
              id group test_control bootstrap
       1: 701570     1            C         1
       2: 424018     1            C         1
       3: 909932     1            C         1
       4:  15354     1            C         1
       5: 514882     1            C         1
      ---
23999996: 898651    24            T      1000
23999997: 482374    24            T      1000
23999998: 845577    24            T      1000
23999999: 862359    24            T      1000
24000000: 602078    24            T      1000

This doesn't involve any of the summary stat calculation time, but here 1000 bootstraps were pulled out on 1 core serially in

   user  system elapsed
 62.574   1.267  63.844

If you need to manually code N to be different for each group, you can do the same thing as with id lookup

# create environments
control.Ns <- new.env()
test.Ns <- new.env()

# assign size values
control.Ns[["1"]]   <- 900
   test.Ns[["1"]]   <- 100
control.Ns[["2"]]   <- 400
   test.Ns[["2"]]   <- 50
    ...             ...
control.Ns[["24"]]   <- 200
   test.Ns[["24"]]   <- 5

then change the big bootstrap loop to look up these values based on the loop's current group:

results <- foreach(n = 1:nBootstraps, .combine="rbind", .inorder=FALSE) %do% {
    foreach(group.i = all_groups, .combine="rbind") %do% {

        # get IDs that correspond to this group, for both test and control
        control_id_vector <- control.ids[[group.i]]
        test_id_vector <- test.ids[[group.i]]

        # get size values
        N.control <- control.Ns[[group.i]]
        N.test <- test.Ns[[group.i]]

        # search and bind
        controls <- dat[.(sample(control_id_vector, size=N.control, replace=TRUE))]
        tests <- dat[.(sample(test_id_vector, size=N.test, replace=TRUE))]
        dat.group <- rbindlist(list(controls, tests))
        dat.group[, bootstrap := n]
        return(dat.group[])
    }
    # summarize across all groups for this bootstrap and return summary stat data.table object

}
like image 68
caw5cv Avatar answered Sep 29 '22 14:09

caw5cv


Just like caw5cv, I recommend taking a look at data.table it is usually very efficient in solving such problems, however if you want to choose to work with dplyr then you can try doing something like this:

summary_of_boot_data <- lapply(1:nreps, 
                           function(y){
                             # get bootdata
                             bootdata <- lapply(unique(datafile$group), 
                                                function(x){
                                                  tstring<-paste0(x,"T")
                                                  cstring<-paste0(x,"C")
                                                  tsize<-round(.90*length(which(datafile$group==x)),0)
                                                  csize<-length(which(datafile$group==x))-tsize
                                                  df <-rbind(datafile[sample(which(datafile$groupT==tstring),
                                                                             size=tsize, 
                                                                             replace=TRUE),],
                                                             datafile[sample(which(datafile$groupT==cstring),
                                                                             size=csize,
                                                                             replace=TRUE),])
                                                  return(df)
                                                }) %>% do.call(rbind, .)
                             # return your summary thing for bootdata e.g. summary(bootdata)
                             summary(bootdata)

                           })
summary_of_boot_data

I tried not changing you code a lot, I just replaced the use of for with lapply

hope this helps

EDIT: Based on the comment from Hugh you might want to try using data.table::rbindlist() instead of do.call(rbind, .)

like image 36
DS_UNI Avatar answered Sep 29 '22 14:09

DS_UNI