Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing subsetting with data.table in a loop

I have a basic question on how to optimize the following code. This is a very very abbreviated version of my code. Basically, I have a large data.table (> 50M rows) and I would like to subset the data very often (say 10000 times) and run some function on the subset (obviously more complicated than the one shown in the example below, i.e. I need all columns of the subset and the function returns a new data.table). I just picked the mean to make the example simple.

dt <- data.table(a=sample(letters, 1000000,replace=T),b=sample(1:100000))

mm <- list()

foo <- function(x) mean(x$b)

for(i in 1:1000)
{
  mm[[i]] <-  foo(dt[a %in% sample(letters,5)])
}

It is obvious that this is not the fastest way to program even this minimal example (setting keys etc.).

I am interested, however, how to optimize the for loop. I had in mind to create indices for the subsets and then use data.table dt[,foo(.SD),by=subset_ID] , but I am not sure on how to this, since I am sampling with replacement (multiple group IDs). Any ideas based on data.table would be much appreciated (e.g. how to remove the loop?).

like image 694
Puki Luki Avatar asked Oct 17 '22 03:10

Puki Luki


1 Answers

I had in mind to create indices for the subsets and then use data.table dt[,foo(.SD),by=subset_ID], but I am not sure on how to this, since I am sampling with replacement (multiple group IDs).

With a join, you can have overlapping groups:

# convert to numeric
dt[, b := as.numeric(b)]

# make samples
set.seed(1)
mDT = setDT(melt(replicate(1000, sample(letters,5))))
setnames(mDT, c("seqi", "g", "a"))

# compute function on each sample
dt[mDT, on=.(a), allow.cartesian=TRUE, .(g, b)][, .(res = mean(b)), by=g]

which gives

         g      res
   1:    1 50017.85
   2:    2 49980.03
   3:    3 50093.80
   4:    4 50087.67
   5:    5 49990.83
  ---              
 996:  996 50013.11
 997:  997 50095.43
 998:  998 49913.61
 999:  999 50058.44
1000: 1000 49909.36

To confirm it's doing the right thing, you can check eg,

dt[a %in% mDT[g == 1, a], mean(b)]
# [1] 50017.85

One downside of this approach is that it involves creating a very large table (containing the data for all the samples), which may put you in trouble, RAM-wise.

This approach is taking advantage of your function being mean, since passing it explicitly allows for certain optimizations; see ?GForce, which is also why I converted b to numeric.

I agree with Rob Jensen's suggestion to pass columns to the function instead of passing a table (with the function making assumptions about what columns appear in the table), both for efficiency and clarity.

In the specific case of taking the mean, you could speed this up further by adding up for each letter first, I think:

dtagg = dt[, .(.N, sumb = sum(b)), by=a]

dtagg[mDT, on=.(a), .(g, sumb, N)][, lapply(.SD, sum), by=g][, .(g, res = sumb/N)]

         g      res
   1:    1 50017.85
   2:    2 49980.03
   3:    3 50093.80
   4:    4 50087.67
   5:    5 49990.83
  ---              
 996:  996 50013.11
 997:  997 50095.43
 998:  998 49913.61
 999:  999 50058.44
1000: 1000 49909.36

allow.cartesian is not needed in this case, since each row of mDT finds only a single row in dtagg. On my computer, the speedup with the example data is pretty big, but most of the benefit comes from taking advantage of the form of the example function, I guess:

  • 13.7 sec OP's approach
  • 11.4 sec simple join
  • 0.02 sec aggregate-first
like image 81
Frank Avatar answered Oct 21 '22 00:10

Frank