What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Tags:

Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times.

Say the input matrix is:

A=  [0 1 0 0 1  0 0 1 1 1  1 1 0 1 0]

The sparse representation is:

A =  0, 1 0, 4 1, 2 1, 3 1, 4 2, 0 2, 1 2, 3

In Python, it's straightforward to work with the matrix-input format:

import numpy as np from sklearn.metrics import pairwise_distances from scipy.spatial.distance import cosine  A = np.array( [[0, 1, 0, 0, 1], [0, 0, 1, 1, 1], [1, 1, 0, 1, 0]])  dist_out = 1-pairwise_distances(A, metric="cosine") dist_out

Gives:

array([[ 1.        ,  0.40824829,  0.40824829],        [ 0.40824829,  1.        ,  0.33333333],        [ 0.40824829,  0.33333333,  1.        ]])

That's fine for a full-matrix input, but I really want to start with the sparse representation (due to the size and sparsity of my matrix). Any ideas about how this could best be accomplished? Thanks in advance.

957

asked Jul 13 '13 05:07

zbinsd

2 Answers

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:

from sklearn.metrics.pairwise import cosine_similarity from scipy import sparse  A =  np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]]) A_sparse = sparse.csr_matrix(A)  similarities = cosine_similarity(A_sparse) print('pairwise dense output:\n {}\n'.format(similarities))  #also can output sparse matrices similarities_sparse = cosine_similarity(A_sparse,dense_output=False) print('pairwise sparse output:\n {}\n'.format(similarities_sparse))

Results:

pairwise dense output: [[ 1.          0.40824829  0.40824829] [ 0.40824829  1.          0.33333333] [ 0.40824829  0.33333333  1.        ]]  pairwise sparse output: (0, 1)  0.408248290464 (0, 2)  0.408248290464 (0, 0)  1.0 (1, 0)  0.408248290464 (1, 2)  0.333333333333 (1, 1)  1.0 (2, 1)  0.333333333333 (2, 0)  0.408248290464 (2, 2)  1.0

If you want column-wise cosine similarities simply transpose your input matrix beforehand:

A_sparse.transpose()

answered Sep 30 '22 13:09

Jeff

The following method is about 30 times faster than scipy.spatial.distance.pdist. It works pretty quickly on large matrices (assuming you have enough RAM)

See below for a discussion of how to optimize for sparsity.

import numpy as np  # base similarity matrix (all dot products) # replace this with A.dot(A.T).toarray() for sparse representation similarity = np.dot(A, A.T)  # squared magnitude of preference vectors (number of occurrences) square_mag = np.diag(similarity)  # inverse squared magnitude inv_square_mag = 1 / square_mag  # if it doesn't occur, set it's inverse magnitude to zero (instead of inf) inv_square_mag[np.isinf(inv_square_mag)] = 0  # inverse of the magnitude inv_mag = np.sqrt(inv_square_mag)      # cosine similarity (elementwise multiply by inverse magnitudes) cosine = similarity * inv_mag cosine = cosine.T * inv_mag

If your problem is typical for large scale binary preference problems, you have a lot more entries in one dimension than the other. Also, the short dimension is the one whose entries you want to calculate similarities between. Let's call this dimension the 'item' dimension.

If this is the case, list your 'items' in rows and create A using scipy.sparse. Then replace the first line as indicated.

If your problem is atypical you'll need more modifications. Those should be pretty straightforward replacements of basic numpy operations with their scipy.sparse equivalents.

answered Sep 30 '22 14:09

Waylon Flinn

Related questions
                            
                                Case insensitive argparse choices
                            
                                Return the current user with Django Rest Framework
                            
                                Number of regex matches
                            
                                Read from a gzip file in python
                            
                                The zip() function in Python 3 [duplicate]
                            
                                How to organize a relatively large Flask application?
                            
                                TypeError: a bytes-like object is required, not 'str'
                            
                                Force Overwrite in Os.Rename
                            
                                Django required field in model form
                            
                                split python source code into multiple files?
                            
                                HTTPError 403 (Forbidden) with Django and python-social-auth connecting to Google with OAuth2
                            
                                Understanding *x ,= lst
                            
                                What is %pylab?
                            
                                Parsing time string in Python
                            
                                matplotlib bar graph black - how do I remove bar borders
                            
                                RFC 1123 Date Representation in Python?
                            
                                How do I call the Python's list while debugging?
                            
                                Roc curve and cut off point. Python
                            
                                Understanding torch.nn.Parameter
                            
                                Python 3 TypeError: must be str, not bytes with sys.stdout.write()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Tags:

python

pandas

numpy

similarity

cosine-similarity

zbinsd

People also ask

2 Answers

Jeff

Waylon Flinn

Recent Activity

Donate For Us