What is the fastest calculation for bipartite distance in R with a parallelized Rcpp backend?
parallelDist
is a great package with a cpp backend and support for multi-threading, but does not support bipartite distance calculations (to my knowledge).
Using parallelDist()
for bipartite distance matrix computation. This involves calculating m1:m1 and m2:m2 in addition to m1:m2 -- highly inefficient.
library(parallelDist)
bipartiteDist <- function(matrix1,matrix2){
matrix12 <- rbind(matrix1,matrix2)
d <- parallelDist(matrix12)
d <- as.matrix(d)[(1:nrow(matrix1)),((nrow(matrix1)+1):(nrow(matrix1)*2))]
d
}
matrix1 <- abs(matrix(rnorm(1000),10,100000))
matrix2 <- abs(matrix(rnorm(1000),10,100000))
dist <- bipartiteDist(matrix1, matrix2)
This approach is faster than pDist or a pure R implementation when more than 3 cores are available.
pdist
is perfect for computing bipartite distances, but does not support multithreading.
Any fast implementations for parallelized bipartite distance computation?
The wordspace dist.matrix()
function supports parallelized calculation of bipartite distances.
Benchmarking wordspace
against parallelDist
matrix1 <- abs(matrix(rnorm(1000),100,100000))
matrix2 <- abs(matrix(rnorm(1000),100,100000))
library(rbenchmark)
library(parallelDist)
library(wordspace)
bipartiteDist_parallelDist <- function(matrix1,matrix2){
matrix12 <- rbind(matrix1,matrix2)
d <- parallelDist(matrix12, method = "euclidean")
d <- as.matrix(d)[(1:nrow(matrix1)),((nrow(matrix1)+1):(nrow(matrix1)*2))]
d
}
bipartiteDist_wordspace <- function(matrix1,matrix2){
wordspace.openmp(threads = wordspace.openmp()$max)
dist.matrix(matrix1,matrix2, byrow = TRUE, method = "euclidean", convert = FALSE)
}
benchmark("parallelDist" = {
bd1 <- bipartiteDist_parallelDist(matrix1,matrix2)
},
"wordspace" = {
bd2 <- bipartiteDist_wordspace(matrix1,matrix2)
},
replications = 1,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"))
plot(bd1,bd2) # yes, both methods give near-identical results
Benchmarking results:
test replications elapsed relative user.self sys.self
1 parallelDist 1 2.120 12.184 126.145 0.523
2 wordspace 1 0.174 1.000 3.749 0.252
I used 80 threads.
A framework for further speed gains
The author of wordspace
admits to emphasizing low memory load over speed, thus additional speed gains are possible (source).
For instance, here is a general framework for Euclidean distance:
bipartiteDist3 <- function(matrix1,matrix2){
m1tm2 <- tcrossprod(matrix1,matrix2)
sq1 <- rowSums(matrix1^2)
sq2 <- rowSums(matrix2^2)
out0 <- outer(sq1, sq2, "+") - 2 * m1tm2
sqrt(out0)
}
I am very interested in a parallelized solution optimized for sparse matrices. To my knowledge, wordspace
does not optimize for sparsity. For example, there are parallelizable sparse matrix implementations of tcrossprod, rowSums, and outer function equivalents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With