Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing sparse pairwise distance matrix in R

Tags:

I have a NxM matrix and I want to compute the NxN matrix of Euclidean distances between the M points. In my problem, N is about 100,000. As I plan to use this matrix for a k-nearest neighbor algorithm, I only need to keep the k smallest distances, so the resulting NxN matrix is very sparse. This is in contrast to what comes out of dist(), for example, which would result in a dense matrix (and probably storage problems for my size N).

The packages for kNN that I've found so far (knnflex, kknn, etc) all appear to use dense matrices. Also, the Matrix package does not offer a pairwise distance function.

Closer to my goal, I see that the spam package has a nearest.dist() function that allows one to only consider distances less than some threshold, delta. In my case, however, a particular value of delta may produce too many distances (so that I have to store the NxN matrix densely) or too few distances (so that I can't use kNN).

I have seen previous discussion on trying to perform k-means clustering using the bigmemory/biganalytics packages, but it doesn't seem like I can leverage these methods in this case.

Does anybody know a function/implementation that will compute a distance matrix in a sparse fashion in R? My (dreaded) backup plan is to have two for loops and save results in a Matrix object.

like image 540
Christopher DuBois Avatar asked Apr 06 '11 00:04

Christopher DuBois


2 Answers

If you want to keep the logic of your min.k.dist function and return duplicate distances, you might want to consider modifying it a bit. It seems pointless to return the first line with 0 distance, right? ...and by incorporating some of the tricks in my other answer, you can speed up your version by some 30%:

min.k.dists2 <- function(x, k=4L) {
  k <- max(2L, k + 1L)
  apply(x, 2, function(r) {
    sort.list(colSums((x - r)^2), na.last=NA, method='quick')[2:k]
  })
}

> n<-1e4; m<-3; m=matrix(runif(n*m), n)
> system.time(d <- min.k.dists(t(m), 4)) #To get 3 nearest neighbours and itself
   user  system elapsed 
  17.26    0.00   17.30 
> system.time(d <- min.k.dists2(t(m), 3)) #To get 3 nearest neighbours
   user  system elapsed 
   12.7     0.0    12.7 
like image 90
Tommy Avatar answered Oct 25 '22 14:10

Tommy


For now I am using the following, inspired by this answer. The output is a n x k matrix where element (i,k) is the index of the data point that is the kth closest to i.

n <- 10
d <- 3
x <- matrix(rnorm(n * d), ncol = n)

min.k.dists <- function(x,k=5) {
  apply(x,2,function(r) {
    b <- colSums((x - r)^2)
    o <- order(b)
    o[1:k]
  })
}

min.k.dists(x)  # first row should be 1:ncol(x); these points have distance 0
dist(t(x))      # can check answer against this

If one is worried about how ties are handled and whatnot, perhaps rank() should be incorporated.

The above code seems somewhat fast, but I'm sure it could be improved (though I don't have time to go the C or fortran route). So I'm still open to fast and sparse implementations of the above.

Below I include a parallelized version that I ended up using:

min.k.dists <- function(x,k=5,cores=1) {
  require(multicore)
  xx <- as.list(as.data.frame(x))
  names(xx) <- c()
  m <- mclapply(xx,function(r) {
    b <- colSums((x - r)^2)
    o <- order(b)
    o[1:k]
  },mc.cores=cores)
  t(do.call(rbind,m))
}
like image 36
Christopher DuBois Avatar answered Oct 25 '22 15:10

Christopher DuBois