Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to apply function to all pairwise combinations of columns

Tags:

r

data.table

Given a data frame or matrix with arbitrary number of rows and columns, what is the fastest way to apply a function to all pairwise combinations of columns?

For example, if I have a data table:

N <- 3
K <- 3
data <- data.table(id=seq(N))
for(k in seq(K)) {
    data[[k]] <- runif(N)
}

And I want to compute the simple difference between all pairs of columns, I could loop (or lapply) over columns:

differences = data.table(foo=seq(N))
for(var1 in names(data)) {
    for(var2 in names(data)) {
        if (var1==var2) next
        if (which(names(data)==var1)>which(names(data)==var2)) next
        combo <- paste0(var1, var2)
        differences[[combo]] <- data[[var1]]-data[[var2]]
    }
}

But as K gets larger, this becomes absurdly slow.

One solution I've considered is to make two new data tables using combn and subtract them:

a <- data[,combn(colnames(data),2)[1,],with=F]
b <- data[,combn(colnames(data),2)[2,],with=F]
differences <- a-b

But as N and K get larger, this becomes very memory intensive (though faster than looping).

It seems to me that the outer product of the matrix with itself is probably the best way to go, but I can't piece it together. This is especially hard if I want to apply an arbitrary function (RMSE for example), instead of just the difference.

What's the fastest way?

like image 792
dmp Avatar asked Jan 13 '16 02:01

dmp


1 Answers

If it is necessary to have the data in a matrix first, you can do the following:

library(data.table)

data <- matrix(runif(300*500), nrow = 300, ncol = 500)

data.DT <- setkey(data.table(c(data), colId = rep(1:500, each = 300), rowId = rep(1:300, times = 500)), colId)

diff.DT <- data.DT[
  , {
    ccl <- unique(colId)
    vv <- V1
    data.DT[colId > ccl, .(col2 = colId, V1 - vv)]
  }
  , keyby = .(col1 = colId)
]
like image 176
Alexander Radev Avatar answered Sep 27 '22 21:09

Alexander Radev