Have anyone heard about any package or functionality that works the same as the <code>dist{stats}</code> function from R which creates the <blockquote> distance matrix that is computed by using the specified distance measure to compute the distances between the rows of a data matrix, </blockquote> but take a sprase matrix as an input? My data.frame (named <code>dataCluster</code>) has dims: 7000 X 10000 and is almost 99% sparse. In regular form that is not sparse this function doesn't seem to stop working... <pre class="prettyprint"><code>h1 <- hclust( dist( dataCluster ) , method = "complete" ) </code></pre> Similar question without an answer: Sparse Matrix as input to Hierarchical clustering in R

**Update: ** You can do what qlcMatrix does quite easily in fact: <pre class="prettyprint"><code>sparse.cos <- function(x, y = NULL, drop = TRUE){ if(!is.null(y)){ if(class(x) != "dgCMatrix" || class(y) != "dgCMatrix") stop ("class(x) or class(y) != dgCMatrix") if(drop == TRUE) colnames(x) <- rownames(x) <- colnames(y) <- rownames(y) <- NULL crossprod( tcrossprod( x, Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, x@Dim[1]))) ^ -0.5) ), tcrossprod( y, Diagonal(x = as.vector(crossprod(y ^ 2, rep(1, x@Dim[1]))) ^ -0.5)) ) ) } else { if(class(x) != "dgCMatrix") stop ("class(x) != dgCMatrix") if(drop == TRUE) colnames(x) <- rownames(X) <- NULL crossprod( tcrossprod( x, Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, nrow(x)))) ^ -0.5)) ) } } </code></pre> I can find no significant difference in performance between the above and <code>qlcMatrix::cosSparse</code>. <hr> <code>qlcMatrix::cosSparse</code> is faster than <code>wordspace::dist.matrix</code> when data is >50% sparse or the similarity is being calculated on the longest edge of the input matrix (i.e. tall format). Performance of <code>wordspace::dist.matrix</code> vs. <code>qlcMatrix::cosSparse</code> on a wide matrix (1000 x 5000) of varying sparsity (10%, 50%, 90%, or 99% sparse) to calculate a 1000 x 1000 similarity: <pre class="prettyprint"><code># M1 is 10% sparse, M99 is 99% sparse set.seed(123) M10 <- rsparsematrix(5000, 1000, density = 1) M50 <- rsparsematrix(5000, 1000, density = 0.5) M90 <- rsparsematrix(5000, 1000, density = 0.1) M99 <- rsparsematrix(5000, 1000, density = 0.01) tM10 <- t(M10) tM50 <- t(M50) tM90 <- t(M90) tM99 <- t(M99) benchmark( "cosSparse: 10% sparse" = cosSparse(M10), "cosSparse: 50% sparse" = cosSparse(M50), "cosSparse: 90% sparse" = cosSparse(M90), "cosSparse: 99% sparse" = cosSparse(M99), "wordspace: 10% sparse" = dist.matrix(tM10, byrow = TRUE), "wordspace: 50% sparse" = dist.matrix(tM50, byrow = TRUE), "wordspace: 90% sparse" = dist.matrix(tM90, byrow = TRUE), "wordspace: 99% sparse" = dist.matrix(tM99, byrow = TRUE), replications = 2, columns = c("test", "elapsed", "relative")) </code></pre> The two functions are quite comparable, with wordspace taking a slight lead at lower sparsity, but definitely not at high sparsity: <pre class="prettyprint"><code> test elapsed relative 1 cosSparse: 10% sparse 15.83 527.667 2 cosSparse: 50% sparse 4.72 157.333 3 cosSparse: 90% sparse 0.31 10.333 4 cosSparse: 99% sparse 0.03 1.000 5 wordspace: 10% sparse 15.23 507.667 6 wordspace: 50% sparse 4.28 142.667 7 wordspace: 90% sparse 0.36 12.000 8 wordspace: 99% sparse 0.09 3.000 </code></pre> If we flip the calculation around to compute a 5000 x 5000 matrix, then: <pre class="prettyprint"><code>benchmark( "cosSparse: 50% sparse" = cosSparse(tM50), "cosSparse: 90% sparse" = cosSparse(tM90), "cosSparse: 99% sparse" = cosSparse(tM99), "wordspace: 50% sparse" = dist.matrix(M50, byrow = TRUE), "wordspace: 90% sparse" = dist.matrix(M90, byrow = TRUE), "wordspace: 99% sparse" = dist.matrix(M99, byrow = TRUE), replications = 1, columns = c("test", "elapsed", "relative")) </code></pre> Now the competitive advantage of cosSparse becomes very clear: <pre class="prettyprint"><code> test elapsed relative 1 cosSparse: 50% sparse 10.58 151.143 2 cosSparse: 90% sparse 1.44 20.571 3 cosSparse: 99% sparse 0.07 1.000 4 wordspace: 50% sparse 11.41 163.000 5 wordspace: 90% sparse 2.39 34.143 6 wordspace: 99% sparse 0.64 9.143 </code></pre> The change in efficiency is not very dramatic at 50% sparsity, but at 90% sparsity, wordspace is 1.6x slower, and at 99% sparsity it's nearly 10x slower! Compare this performance to a square matrix: <pre class="prettyprint"><code>M50.square <- rsparsematrix(1000, 1000, density = 0.5) tM50.square <- t(M50.square) M90.square <- rsparsematrix(1000, 1000, density = 0.1) tM90.square <- t(M90.square) benchmark( "cosSparse: square, 50% sparse" = cosSparse(M50.square), "wordspace: square, 50% sparse" = dist.matrix(tM50.square, byrow = TRUE), "cosSparse: square, 90% sparse" = cosSparse(M90.square), "wordspace: square, 90% sparse" = dist.matrix(tM90.square, byrow = TRUE), replications = 5, columns = c("test", "elapsed", "relative")) </code></pre> cosSparse is marginally faster at both 50% sparsity, and almost twice as fast at 90% sparsity! <pre class="prettyprint"><code> test elapsed relative 1 cosSparse: square, 50% sparse 2.12 9.217 3 cosSparse: square, 90% sparse 0.23 1.000 2 wordspace: square, 50% sparse 2.15 9.348 4 wordspace: square, 90% sparse 0.40 1.739 </code></pre> Note that the <code>wordspace::dist.matrix</code> has more edge case checks than <code>qlcMatrix::cosSparse</code> and also permits parallelization through <code>openmp</code> in R. Also, <code>wordspace::dist.matrix</code> supports euclidean and jaccard distance measures, although these are far slower. There are a number of other handy features built into that package. That said, if you only need cosine similarity, and your matrix is >50% sparse, and you're computing the tall way, <code>cosSparse</code> should be the tool of choice.

You want <code>wordspace::dist.matrix</code>. It accepts sparse matrices from the <code>Matrix</code> package (which isn’t clear from the documentation) and can also do cross distances, output both <code>Matrix</code> and <code>dist</code> objects and more. The default distance measure is <code>'cosine'</code> though, so be sure to specify <code>method = 'euclidean'</code> if you want that.

Is there any sparse support for dist function in R?

Tags:

r

sparse-matrix

hierarchical-clustering

Have anyone heard about any package or functionality that works the same as the dist{stats} function from R which creates the

distance matrix that is computed by using the specified distance measure to compute the distances between the rows of a data matrix,

but take a sprase matrix as an input?

My data.frame (named dataCluster) has dims: 7000 X 10000 and is almost 99% sparse. In regular form that is not sparse this function doesn't seem to stop working...

h1 <- hclust( dist( dataCluster ) , method = "complete" )

Similar question without an answer: Sparse Matrix as input to Hierarchical clustering in R

915

asked Apr 02 '15 16:04

Marcin Kosiński

2 Answers

**Update: ** You can do what qlcMatrix does quite easily in fact:

sparse.cos <- function(x, y = NULL, drop = TRUE){
    if(!is.null(y)){
        if(class(x) != "dgCMatrix" || class(y) != "dgCMatrix") stop ("class(x) or class(y) != dgCMatrix")
        if(drop == TRUE) colnames(x) <- rownames(x) <- colnames(y) <- rownames(y) <- NULL
        crossprod(
            tcrossprod(
                x, 
                Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, x@Dim[1]))) ^ -0.5)
            ),
            tcrossprod(
                y, 
                Diagonal(x = as.vector(crossprod(y ^ 2, rep(1, x@Dim[1]))) ^ -0.5))
            )
        )
    } else {
        if(class(x) != "dgCMatrix") stop ("class(x) != dgCMatrix")
        if(drop == TRUE) colnames(x) <- rownames(X) <- NULL
        crossprod(
            tcrossprod(
                x,
                Diagonal(x = as.vector(crossprod(x ^ 2, rep(1, nrow(x)))) ^ -0.5))
        )
    }
}

I can find no significant difference in performance between the above and qlcMatrix::cosSparse.

qlcMatrix::cosSparse is faster than wordspace::dist.matrix when data is >50% sparse or the similarity is being calculated on the longest edge of the input matrix (i.e. tall format).

Performance of wordspace::dist.matrix vs. qlcMatrix::cosSparse on a wide matrix (1000 x 5000) of varying sparsity (10%, 50%, 90%, or 99% sparse) to calculate a 1000 x 1000 similarity:

# M1 is 10% sparse, M99 is 99% sparse
set.seed(123)
M10 <- rsparsematrix(5000, 1000, density = 1)
M50 <- rsparsematrix(5000, 1000, density = 0.5)
M90 <- rsparsematrix(5000, 1000, density = 0.1)
M99 <- rsparsematrix(5000, 1000, density = 0.01)
tM10 <- t(M10)
tM50 <- t(M50)
tM90 <- t(M90)
tM99 <- t(M99)
benchmark(
 "cosSparse: 10% sparse" = cosSparse(M10),
 "cosSparse: 50% sparse" = cosSparse(M50),
 "cosSparse: 90% sparse" = cosSparse(M90),
 "cosSparse: 99% sparse" = cosSparse(M99),
 "wordspace: 10% sparse" = dist.matrix(tM10, byrow = TRUE),
 "wordspace: 50% sparse" = dist.matrix(tM50, byrow = TRUE),
 "wordspace: 90% sparse" = dist.matrix(tM90, byrow = TRUE),
 "wordspace: 99% sparse" = dist.matrix(tM99, byrow = TRUE),
 replications = 2, columns = c("test", "elapsed", "relative"))

The two functions are quite comparable, with wordspace taking a slight lead at lower sparsity, but definitely not at high sparsity:

                   test elapsed relative
1 cosSparse: 10% sparse   15.83  527.667
2 cosSparse: 50% sparse    4.72  157.333
3 cosSparse: 90% sparse    0.31   10.333
4 cosSparse: 99% sparse    0.03    1.000
5 wordspace: 10% sparse   15.23  507.667
6 wordspace: 50% sparse    4.28  142.667
7 wordspace: 90% sparse    0.36   12.000
8 wordspace: 99% sparse    0.09    3.000

If we flip the calculation around to compute a 5000 x 5000 matrix, then:

benchmark(
 "cosSparse: 50% sparse" = cosSparse(tM50),
 "cosSparse: 90% sparse" = cosSparse(tM90),
 "cosSparse: 99% sparse" = cosSparse(tM99),
 "wordspace: 50% sparse" = dist.matrix(M50, byrow = TRUE),
 "wordspace: 90% sparse" = dist.matrix(M90, byrow = TRUE),
 "wordspace: 99% sparse" = dist.matrix(M99, byrow = TRUE),
 replications = 1, columns = c("test", "elapsed", "relative"))

Now the competitive advantage of cosSparse becomes very clear:

                   test elapsed relative
1 cosSparse: 50% sparse   10.58  151.143
2 cosSparse: 90% sparse    1.44   20.571
3 cosSparse: 99% sparse    0.07    1.000
4 wordspace: 50% sparse   11.41  163.000
5 wordspace: 90% sparse    2.39   34.143
6 wordspace: 99% sparse    0.64    9.143

The change in efficiency is not very dramatic at 50% sparsity, but at 90% sparsity, wordspace is 1.6x slower, and at 99% sparsity it's nearly 10x slower!

Compare this performance to a square matrix:

M50.square <- rsparsematrix(1000, 1000, density = 0.5)
tM50.square <- t(M50.square)
M90.square <- rsparsematrix(1000, 1000, density = 0.1)
tM90.square <- t(M90.square)

benchmark(
 "cosSparse: square, 50% sparse" = cosSparse(M50.square),
 "wordspace: square, 50% sparse" = dist.matrix(tM50.square, byrow = TRUE),
 "cosSparse: square, 90% sparse" = cosSparse(M90.square),
 "wordspace: square, 90% sparse" = dist.matrix(tM90.square, byrow = TRUE),
 replications = 5, columns = c("test", "elapsed", "relative"))

cosSparse is marginally faster at both 50% sparsity, and almost twice as fast at 90% sparsity!

                           test elapsed relative
1 cosSparse: square, 50% sparse    2.12    9.217
3 cosSparse: square, 90% sparse    0.23    1.000
2 wordspace: square, 50% sparse    2.15    9.348
4 wordspace: square, 90% sparse    0.40    1.739

Note that the wordspace::dist.matrix has more edge case checks than qlcMatrix::cosSparse and also permits parallelization through openmp in R. Also, wordspace::dist.matrix supports euclidean and jaccard distance measures, although these are far slower. There are a number of other handy features built into that package.

That said, if you only need cosine similarity, and your matrix is >50% sparse, and you're computing the tall way, cosSparse should be the tool of choice.

125

answered Nov 16 '22 01:11

zdebruine

You want wordspace::dist.matrix.

It accepts sparse matrices from the Matrix package (which isn’t clear from the documentation) and can also do cross distances, output both Matrix and dist objects and more.

The default distance measure is 'cosine' though, so be sure to specify method = 'euclidean' if you want that.

answered Nov 16 '22 00:11

flying sheep

Related questions
                            
                                prevent plot_ly reordering matrix
                            
                                Unexpected match of regex
                            
                                how to impute the distance to a value
                            
                                Performance benefits of chaining over ANDing when filtering a data table
                            
                                How to make a CRAN package to download data only once regardless of OS?
                            
                                How can I suppress the vertical gridlines in a ggplot2 plot while retaining the x-axis labels?
                            
                                How can I pass flags to R when it is compiling C++ code to be used in a package?
                            
                                Relative positioning of geom_text in ggplot2?
                            
                                Printing dataframes with long strings in R
                            
                                Merge data tables like data frames in R
                            
                                R flatten out list hierarchy to matrix or data.frame
                            
                                embedding a R animated 3D plot in PowerPoint
                            
                                Generate graphs in R for certain correlations in a matrix
                            
                                Is there an S4 equivalent to unlist()?
                            
                                How to setup vim/emacs keybindings in RStudio?
                            
                                Creating a threshold-coded ROC plot in Python
                            
                                Vignette can't find data files during devtools::check
                            
                                How to solve a least squares (underdetermined system) quickly?
                            
                                dplyr does not group data by date
                            
                                Store output from gridExtra::grid.arrange into an object

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With