Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any sparse support for dist function in R?

Have anyone heard about any package or functionality that works the same as the dist{stats} function from R which creates the

distance matrix that is computed by using the specified distance measure to compute the distances between the rows of a data matrix,

but take a sprase matrix as an input?

My data.frame (named dataCluster) has dims: 7000 X 10000 and is almost 99% sparse. In regular form that is not sparse this function doesn't seem to stop working...

h1 <- hclust( dist( dataCluster ) , method = "complete" )

Similar question without an answer: Sparse Matrix as input to Hierarchical clustering in R

like image 915
Marcin Kosiński Avatar asked Apr 02 '15 16:04

Marcin Kosiński


People also ask

What is Dist function in R?

Definition: The dist R function computes distance matrices. Basic R Syntax: You can find the basic R programming syntax of the dist function below. dist ( x ) # Basic R syntax of dist function

How do you represent a sparse matrix?

There is of course the question of how to represent the sparse matrix. A simple way is to have it only contain the indices of the points that are closest (and recalculate as needed). But in the solution below, I put both distance ('d1' etc) and index ('i1' etc) in a single matrix:

What is get_Dist in Matplotlib?

get_dist (): Computes a distance matrix between the rows of a data matrix. Compared to the standard dist () function, it supports correlation-based distance measures including "pearson", "kendall" and "spearman" methods.

What is the difference between get_Dist () and standard Dist () functions?

Compared to the standard dist () function, it supports correlation-based distance measures including "pearson", "kendall" and "spearman" methods. get_dist (x, method = "euclidean", stand = FALSE, ...) fviz_dist ( dist.obj, order = TRUE, show_labels = TRUE, lab_size = NULL, gradient = list (low = "red", mid = "white", high = "blue") )


2 Answers

**Update: ** You can do what qlcMatrix does quite easily in fact:

sparse.cos <- function(x, y = NULL, drop = TRUE){
    if(!is.null(y)){
        if(class(x) != "dgCMatrix" || class(y) != "dgCMatrix") stop ("class(x) or class(y) != dgCMatrix")
        if(drop == TRUE) colnames(x) <- rownames(x) <- colnames(y) <- rownames(y) <- NULL
        crossprod(
            tcrossprod(
                x, 
                Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, x@Dim[1]))) ^ -0.5)
            ),
            tcrossprod(
                y, 
                Diagonal(x = as.vector(crossprod(y ^ 2, rep(1, x@Dim[1]))) ^ -0.5))
            )
        )
    } else {
        if(class(x) != "dgCMatrix") stop ("class(x) != dgCMatrix")
        if(drop == TRUE) colnames(x) <- rownames(X) <- NULL
        crossprod(
            tcrossprod(
                x,
                Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, nrow(x)))) ^ -0.5))
        )
    }
}

I can find no significant difference in performance between the above and qlcMatrix::cosSparse.


qlcMatrix::cosSparse is faster than wordspace::dist.matrix when data is >50% sparse or the similarity is being calculated on the longest edge of the input matrix (i.e. tall format).

Performance of wordspace::dist.matrix vs. qlcMatrix::cosSparse on a wide matrix (1000 x 5000) of varying sparsity (10%, 50%, 90%, or 99% sparse) to calculate a 1000 x 1000 similarity:

# M1 is 10% sparse, M99 is 99% sparse
set.seed(123)
M10 <- rsparsematrix(5000, 1000, density = 1)
M50 <- rsparsematrix(5000, 1000, density = 0.5)
M90 <- rsparsematrix(5000, 1000, density = 0.1)
M99 <- rsparsematrix(5000, 1000, density = 0.01)
tM10 <- t(M10)
tM50 <- t(M50)
tM90 <- t(M90)
tM99 <- t(M99)
benchmark(
 "cosSparse: 10% sparse" = cosSparse(M10),
 "cosSparse: 50% sparse" = cosSparse(M50),
 "cosSparse: 90% sparse" = cosSparse(M90),
 "cosSparse: 99% sparse" = cosSparse(M99),
 "wordspace: 10% sparse" = dist.matrix(tM10, byrow = TRUE),
 "wordspace: 50% sparse" = dist.matrix(tM50, byrow = TRUE),
 "wordspace: 90% sparse" = dist.matrix(tM90, byrow = TRUE),
 "wordspace: 99% sparse" = dist.matrix(tM99, byrow = TRUE),
 replications = 2, columns = c("test", "elapsed", "relative"))

The two functions are quite comparable, with wordspace taking a slight lead at lower sparsity, but definitely not at high sparsity:

                   test elapsed relative
1 cosSparse: 10% sparse   15.83  527.667
2 cosSparse: 50% sparse    4.72  157.333
3 cosSparse: 90% sparse    0.31   10.333
4 cosSparse: 99% sparse    0.03    1.000
5 wordspace: 10% sparse   15.23  507.667
6 wordspace: 50% sparse    4.28  142.667
7 wordspace: 90% sparse    0.36   12.000
8 wordspace: 99% sparse    0.09    3.000

If we flip the calculation around to compute a 5000 x 5000 matrix, then:

benchmark(
 "cosSparse: 50% sparse" = cosSparse(tM50),
 "cosSparse: 90% sparse" = cosSparse(tM90),
 "cosSparse: 99% sparse" = cosSparse(tM99),
 "wordspace: 50% sparse" = dist.matrix(M50, byrow = TRUE),
 "wordspace: 90% sparse" = dist.matrix(M90, byrow = TRUE),
 "wordspace: 99% sparse" = dist.matrix(M99, byrow = TRUE),
 replications = 1, columns = c("test", "elapsed", "relative"))

Now the competitive advantage of cosSparse becomes very clear:

                   test elapsed relative
1 cosSparse: 50% sparse   10.58  151.143
2 cosSparse: 90% sparse    1.44   20.571
3 cosSparse: 99% sparse    0.07    1.000
4 wordspace: 50% sparse   11.41  163.000
5 wordspace: 90% sparse    2.39   34.143
6 wordspace: 99% sparse    0.64    9.143

The change in efficiency is not very dramatic at 50% sparsity, but at 90% sparsity, wordspace is 1.6x slower, and at 99% sparsity it's nearly 10x slower!

Compare this performance to a square matrix:

M50.square <- rsparsematrix(1000, 1000, density = 0.5)
tM50.square <- t(M50.square)
M90.square <- rsparsematrix(1000, 1000, density = 0.1)
tM90.square <- t(M90.square)

benchmark(
 "cosSparse: square, 50% sparse" = cosSparse(M50.square),
 "wordspace: square, 50% sparse" = dist.matrix(tM50.square, byrow = TRUE),
 "cosSparse: square, 90% sparse" = cosSparse(M90.square),
 "wordspace: square, 90% sparse" = dist.matrix(tM90.square, byrow = TRUE),
 replications = 5, columns = c("test", "elapsed", "relative"))

cosSparse is marginally faster at both 50% sparsity, and almost twice as fast at 90% sparsity!

                           test elapsed relative
1 cosSparse: square, 50% sparse    2.12    9.217
3 cosSparse: square, 90% sparse    0.23    1.000
2 wordspace: square, 50% sparse    2.15    9.348
4 wordspace: square, 90% sparse    0.40    1.739

Note that the wordspace::dist.matrix has more edge case checks than qlcMatrix::cosSparse and also permits parallelization through openmp in R. Also, wordspace::dist.matrix supports euclidean and jaccard distance measures, although these are far slower. There are a number of other handy features built into that package.

That said, if you only need cosine similarity, and your matrix is >50% sparse, and you're computing the tall way, cosSparse should be the tool of choice.

like image 125
zdebruine Avatar answered Nov 16 '22 01:11

zdebruine


You want wordspace::dist.matrix.

It accepts sparse matrices from the Matrix package (which isn’t clear from the documentation) and can also do cross distances, output both Matrix and dist objects and more.

The default distance measure is 'cosine' though, so be sure to specify method = 'euclidean' if you want that.

like image 35
flying sheep Avatar answered Nov 16 '22 00:11

flying sheep