Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallel distance Matrix in R

currently I'm using the build in function dist to calculate my distance matrix in R.

dist(featureVector,method="manhattan")

This is currently the bottlneck of the application and therefore the idea was to parallize this task(conceptually this should be possible)

Searching google and this forum did not succeed.

Does anybody has an idea?

like image 376
Vespasian Avatar asked Jun 16 '13 22:06

Vespasian


People also ask

What is a distance matrix in R?

The dist() function in R can be used to calculate a distance matrix, which displays the distances between the rows of a matrix or data frame. This function uses the following basic syntax: dist(x, method=”euclidean”) where: x: The name of the matrix or data frame.

How do you find the distance between two matrices?

If we have two matrices A,B. Distance between A and B can be calculated using Singular values or 2 norms. You may use Distance =|(fnorm(A)−fnorm(B))| where fnorm = sq root of sum of squares of all singular values.

How many elements are in the distance matrix in R?

In general, for a data sample of size M, the distance matrix is an M × M symmetric matrix with M × (M - 1)∕2 distinct elements.

What is a distance matrix algorithm?

A distance matrix is utilized in the k-NN algorithm which is the one of the slowest but simplest and most used instance-based machine learning algorithm that can be used in both in classification and regression tasks.


2 Answers

The R package amap provides robust and parallelized functions for Clustering and Principal Component Analysis. Among these functions, Dist method offers what you are looking for: computes and returns the distance matrix in a parallel manner.

Dist(x, method = "euclidean", nbproc = 8)

The code above compute euclidean distance with 8 threads.

like image 171
Zhilong Jia Avatar answered Oct 13 '22 08:10

Zhilong Jia


Here's the structure for one route you could go. It is not faster than just using the dist() function, instead taking many times longer. It does process in parallel, but even if the computation time were reduced to zero, the time to start up the function and export the variables to the cluster would probably be longer than just using dist()

library(parallel)

vec.array <- matrix(rnorm(2000 * 100), nrow = 2000, ncol = 100)

TaxiDistFun <- function(one.vec, whole.matrix) {
    diff.matrix <- t(t(whole.matrix) - one.vec)
    this.row <- apply(diff.matrix, 1, function(x) sum(abs(x)))
    return(this.row)
}

cl <- makeCluster(detectCores())
clusterExport(cl, list("vec.array", "TaxiDistFun"))

system.time(dist.array <- parRapply(cl, vec.array,
                        function(x) TaxiDistFun(x, vec.array)))

stopCluster(cl)

dim(dist.array) <- c(2000, 2000)
like image 29
Will Beason Avatar answered Oct 13 '22 08:10

Will Beason