Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate function for all row combinations of two matrices in R

Tags:

r

I would like to calculate a distance measure for all combinations of rows between two matrices/data frames.

The result would be a matrix with cell i,j corresponding to the result given by the function applied to row i of the first matrix and row j of the second matrix. Here is an example illustrating what I want to do with for loops, with an example function.

x<-matrix(rnorm(30),10,3)  ## Example data
y<-matrix(rnorm(12),4,3)

results<-matrix(NA,nrow(x),nrow(y))

for (i in 1:nrow(x)){
  for (j in 1:nrow(y)){
    r1<-x[i,]
    r2<-y[j,]
    results[i,j]<-sum(r1*r2)  ## Example function
  }
}

In real life I have the first matrix having hundreds of thousands of rows, the second matrix having a few hundred rows, and the function I want to calculate is not the dot product (I realize I may have chosen a function that makes it seem like all I want to do is matrix multiplication). In fact, there are a few functions I would like to substitute in so I would like to find a solution that is generalizable to different functions. One way of thinking about it is I would like to hijack matrix multiplication to perform other functions. Calculating this with for loops takes so long it is not practical. I would be so grateful for any tips on how to do this without for loops.

like image 970
klar Avatar asked May 25 '12 18:05

klar


2 Answers

outer(1:nrow(x), 1:nrow(y), Vectorize(function(i, j) sum(x[i, ] * y[j, ])))
like image 161
Julius Vainora Avatar answered Sep 30 '22 01:09

Julius Vainora


I know you asked this a really long time ago, but I thought that I might share a solution with you that will get more efficient compared to the for loop, when the number of rows you have becomes very large. At a small number of rows the speed difference is neglible (and the for loop may even be faster). This relies only on subsetting and the use of rowSums and is very simple:

## For reproducibility
set.seed( 35471 )

## Example data - bigger than the original to get and idea of difference in speed
x<-matrix(rnorm(60),20,3)
y<-matrix(rnorm(300),100,3)

# My function which uses grid.expand to get all combinations of row indices, then rowSums to operate on them
rs <- function( x , y ){
rows <- expand.grid( 1:nrow(x) , 1:nrow(y) )
results <- matrix( rowSums( x[ rows[,1] , ] * y[ rows[,2] , ] ) , nrow(x) , nrow(y) )
return(results)
}

# Your orignal function
flp <- function(x ,y){
results<-matrix(NA,nrow(x),nrow(y))
for (i in 1:nrow(x)){
  for (j in 1:nrow(y)){
    r1<-x[i,]
    r2<-y[j,]
    results[i,j]<-sum(r1*r2)  ## Example function
  }
}
return(results)
}


## Benchmark timings:
library(microbenchmark)
microbenchmark( rs( x, y ) , flp( x ,y ) , times = 100L )
#Unit: microseconds
#     expr      min       lq     median        uq      max neval
#  rs(x, y)  487.500  527.396   558.5425   620.486   679.98   100
# flp(x, y) 9253.385 9656.193 10008.0820 10430.663 11511.70   100

## And a subset of the results returned from each function to confirm they return the same thing!
flp(x,y)[1:3,1:3]
#          [,1]       [,2]       [,3]
#[1,] -0.5528311  0.1095852  0.4461507
#[2,] -1.9495687  1.7814502 -0.3769874
#[3,]  1.8753978 -3.0908057  2.2341414

rs(x,y)[1:3,1:3]
#          [,1]       [,2]       [,3]
#[1,] -0.5528311  0.1095852  0.4461507
#[2,] -1.9495687  1.7814502 -0.3769874
#[3,]  1.8753978 -3.0908057  2.2341414

So you can see that by using rowSums and subsetting we can be 20 times faster than the for loop when the number of row combinations is just 2000. If you have even more the difference in speed will be even greater.

HTH.

like image 20
Simon O'Hanlon Avatar answered Sep 30 '22 02:09

Simon O'Hanlon