Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying a function to a distance matrix in R

Tags:

algorithm

r

This question came today in the manipulatr mailing list.

http://groups.google.com/group/manipulatr/browse_thread/thread/fbab76945f7cba3f

I am rephrasing.

Given a distance matrix (calculated with dist) apply a function to the rows of the distance matrix.

Code:

library(plyr)
N <- 100
a <- data.frame(b=1:N,c=runif(N))
d <- dist(a,diag=T,upper=T)
sumd <- adply(as.matrix(d),1,sum)

The problem is that to apply the function by row you have to store the whole matrix (instead of just the lower triangular part. So it uses too much memory for large matrices. It fails in my computer for matrices of dimensions ~ 10000.

Any ideas?

like image 944
Eduardo Leoni Avatar asked Nov 07 '09 07:11

Eduardo Leoni


People also ask

How does Dist () work in R?

The dist() function in R can be used to calculate a distance matrix, which displays the distances between the rows of a matrix or data frame. where: x: The name of the matrix or data frame. method: The distance measure to use.

How do you find the distance of a matrix in R?

For computing distance matrix by GPU in R programming, we can use the dist() function. dist() function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix.

Which R function can be used to compute the distance between two observations?

dist() function To do this in R, we use the dist function to calculate the euclidean distance between our observations. The function simply requires a data frame or matrix containing your observations and features.

What type of distance does Dist () use by default?

Example 1: Compute Euclidean Distance Using Default Specifications of dist() Function. Have a look at the output of the RStudio console. It shows the distances of each combination of our data rows. Note that the dist function computes the Euclidean Distance by default.


1 Answers

First of all, for anyone who hasn't seen this yet, I strongly recommend reading this article on the r-wiki about code optimization.

Here's another version without using ifelse (that's a relatively slow function):

noeq.2 <- function(i, j, N) {
    i <- i-1
    j <- j-1
    x <- i*(N-1) - (i-1)*((i-1) + 1)/2 + j - i
    x2 <- j*(N-1) - (j-1)*((j-1) + 1)/2 + i - j
    idx <- i < j
    x[!idx] <- x2[!idx]
    x[i==j] <- 0
    x
}

And timings on my laptop:

> N <- 1000
> system.time(sapply(1:N, function(i) sapply(1:N, function(j) noeq(i, j, N))))
   user  system elapsed 
  51.31    0.10   52.06 
> system.time(sapply(1:N, function(j) noeq.1(1:N, j, N)))
   user  system elapsed 
   2.47    0.02    2.67 
> system.time(sapply(1:N, function(j) noeq.2(1:N, j, N)))
   user  system elapsed 
   0.88    0.01    1.12 

And lapply is faster than sapply:

> system.time(do.call("rbind",lapply(1:N, function(j) noeq.2(1:N, j, N))))
   user  system elapsed 
   0.67    0.00    0.67 
like image 105
Shane Avatar answered Sep 22 '22 01:09

Shane