Have anyone heard about any package or functionality that works the same as the dist{stats}
function from R which creates the
distance matrix that is computed by using the specified distance measure to compute the distances between the rows of a data matrix,
but take a sprase matrix as an input?
My data.frame (named dataCluster
) has dims: 7000 X 10000 and is almost 99% sparse. In regular form that is not sparse this function doesn't seem to stop working...
h1 <- hclust( dist( dataCluster ) , method = "complete" )
Similar question without an answer: Sparse Matrix as input to Hierarchical clustering in R
Definition: The dist R function computes distance matrices. Basic R Syntax: You can find the basic R programming syntax of the dist function below. dist ( x ) # Basic R syntax of dist function
There is of course the question of how to represent the sparse matrix. A simple way is to have it only contain the indices of the points that are closest (and recalculate as needed). But in the solution below, I put both distance ('d1' etc) and index ('i1' etc) in a single matrix:
get_dist (): Computes a distance matrix between the rows of a data matrix. Compared to the standard dist () function, it supports correlation-based distance measures including "pearson", "kendall" and "spearman" methods.
Compared to the standard dist () function, it supports correlation-based distance measures including "pearson", "kendall" and "spearman" methods. get_dist (x, method = "euclidean", stand = FALSE, ...) fviz_dist ( dist.obj, order = TRUE, show_labels = TRUE, lab_size = NULL, gradient = list (low = "red", mid = "white", high = "blue") )
**Update: ** You can do what qlcMatrix does quite easily in fact:
sparse.cos <- function(x, y = NULL, drop = TRUE){
if(!is.null(y)){
if(class(x) != "dgCMatrix" || class(y) != "dgCMatrix") stop ("class(x) or class(y) != dgCMatrix")
if(drop == TRUE) colnames(x) <- rownames(x) <- colnames(y) <- rownames(y) <- NULL
crossprod(
tcrossprod(
x,
Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, x@Dim[1]))) ^ -0.5)
),
tcrossprod(
y,
Diagonal(x = as.vector(crossprod(y ^ 2, rep(1, x@Dim[1]))) ^ -0.5))
)
)
} else {
if(class(x) != "dgCMatrix") stop ("class(x) != dgCMatrix")
if(drop == TRUE) colnames(x) <- rownames(X) <- NULL
crossprod(
tcrossprod(
x,
Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, nrow(x)))) ^ -0.5))
)
}
}
I can find no significant difference in performance between the above and qlcMatrix::cosSparse
.
qlcMatrix::cosSparse
is faster than wordspace::dist.matrix
when data is >50% sparse or the similarity is being calculated on the longest edge of the input matrix (i.e. tall format).
Performance of wordspace::dist.matrix
vs. qlcMatrix::cosSparse
on a wide matrix (1000 x 5000) of varying sparsity (10%, 50%, 90%, or 99% sparse) to calculate a 1000 x 1000 similarity:
# M1 is 10% sparse, M99 is 99% sparse
set.seed(123)
M10 <- rsparsematrix(5000, 1000, density = 1)
M50 <- rsparsematrix(5000, 1000, density = 0.5)
M90 <- rsparsematrix(5000, 1000, density = 0.1)
M99 <- rsparsematrix(5000, 1000, density = 0.01)
tM10 <- t(M10)
tM50 <- t(M50)
tM90 <- t(M90)
tM99 <- t(M99)
benchmark(
"cosSparse: 10% sparse" = cosSparse(M10),
"cosSparse: 50% sparse" = cosSparse(M50),
"cosSparse: 90% sparse" = cosSparse(M90),
"cosSparse: 99% sparse" = cosSparse(M99),
"wordspace: 10% sparse" = dist.matrix(tM10, byrow = TRUE),
"wordspace: 50% sparse" = dist.matrix(tM50, byrow = TRUE),
"wordspace: 90% sparse" = dist.matrix(tM90, byrow = TRUE),
"wordspace: 99% sparse" = dist.matrix(tM99, byrow = TRUE),
replications = 2, columns = c("test", "elapsed", "relative"))
The two functions are quite comparable, with wordspace taking a slight lead at lower sparsity, but definitely not at high sparsity:
test elapsed relative
1 cosSparse: 10% sparse 15.83 527.667
2 cosSparse: 50% sparse 4.72 157.333
3 cosSparse: 90% sparse 0.31 10.333
4 cosSparse: 99% sparse 0.03 1.000
5 wordspace: 10% sparse 15.23 507.667
6 wordspace: 50% sparse 4.28 142.667
7 wordspace: 90% sparse 0.36 12.000
8 wordspace: 99% sparse 0.09 3.000
If we flip the calculation around to compute a 5000 x 5000 matrix, then:
benchmark(
"cosSparse: 50% sparse" = cosSparse(tM50),
"cosSparse: 90% sparse" = cosSparse(tM90),
"cosSparse: 99% sparse" = cosSparse(tM99),
"wordspace: 50% sparse" = dist.matrix(M50, byrow = TRUE),
"wordspace: 90% sparse" = dist.matrix(M90, byrow = TRUE),
"wordspace: 99% sparse" = dist.matrix(M99, byrow = TRUE),
replications = 1, columns = c("test", "elapsed", "relative"))
Now the competitive advantage of cosSparse becomes very clear:
test elapsed relative
1 cosSparse: 50% sparse 10.58 151.143
2 cosSparse: 90% sparse 1.44 20.571
3 cosSparse: 99% sparse 0.07 1.000
4 wordspace: 50% sparse 11.41 163.000
5 wordspace: 90% sparse 2.39 34.143
6 wordspace: 99% sparse 0.64 9.143
The change in efficiency is not very dramatic at 50% sparsity, but at 90% sparsity, wordspace is 1.6x slower, and at 99% sparsity it's nearly 10x slower!
Compare this performance to a square matrix:
M50.square <- rsparsematrix(1000, 1000, density = 0.5)
tM50.square <- t(M50.square)
M90.square <- rsparsematrix(1000, 1000, density = 0.1)
tM90.square <- t(M90.square)
benchmark(
"cosSparse: square, 50% sparse" = cosSparse(M50.square),
"wordspace: square, 50% sparse" = dist.matrix(tM50.square, byrow = TRUE),
"cosSparse: square, 90% sparse" = cosSparse(M90.square),
"wordspace: square, 90% sparse" = dist.matrix(tM90.square, byrow = TRUE),
replications = 5, columns = c("test", "elapsed", "relative"))
cosSparse is marginally faster at both 50% sparsity, and almost twice as fast at 90% sparsity!
test elapsed relative
1 cosSparse: square, 50% sparse 2.12 9.217
3 cosSparse: square, 90% sparse 0.23 1.000
2 wordspace: square, 50% sparse 2.15 9.348
4 wordspace: square, 90% sparse 0.40 1.739
Note that the wordspace::dist.matrix
has more edge case checks than qlcMatrix::cosSparse
and also permits parallelization through openmp
in R. Also, wordspace::dist.matrix
supports euclidean and jaccard distance measures, although these are far slower. There are a number of other handy features built into that package.
That said, if you only need cosine similarity, and your matrix is >50% sparse, and you're computing the tall way, cosSparse
should be the tool of choice.
You want wordspace::dist.matrix
.
It accepts sparse matrices from the Matrix
package (which isn’t clear from the documentation) and can also do cross distances, output both Matrix
and dist
objects and more.
The default distance measure is 'cosine'
though, so be sure to specify method = 'euclidean'
if you want that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With