Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times.

Say the input matrix is:

A=  [0 1 0 0 1  0 0 1 1 1  1 1 0 1 0] 

The sparse representation is:

A =  0, 1 0, 4 1, 2 1, 3 1, 4 2, 0 2, 1 2, 3 

In Python, it's straightforward to work with the matrix-input format:

import numpy as np from sklearn.metrics import pairwise_distances from scipy.spatial.distance import cosine  A = np.array( [[0, 1, 0, 0, 1], [0, 0, 1, 1, 1], [1, 1, 0, 1, 0]])  dist_out = 1-pairwise_distances(A, metric="cosine") dist_out 

Gives:

array([[ 1.        ,  0.40824829,  0.40824829],        [ 0.40824829,  1.        ,  0.33333333],        [ 0.40824829,  0.33333333,  1.        ]]) 

That's fine for a full-matrix input, but I really want to start with the sparse representation (due to the size and sparsity of my matrix). Any ideas about how this could best be accomplished? Thanks in advance.

like image 957
zbinsd Avatar asked Jul 13 '13 05:07

zbinsd


People also ask

How do you find cosine similarity in Python?

We use the below formula to compute the cosine similarity. where A and B are vectors: A.B is dot product of A and B: It is computed as sum of element-wise product of A and B. ||A|| is L2 norm of A: It is computed as square root of the sum of squares of elements of the vector A.

How do you do cosine similarity in Numpy?

Using numpy. array()function we will create x & y arrays of the same length. In the above code, we import numpy package to use dot() and norm() functions to calculate Cosine Similarity in python. Using dot(x, y)/(norm(x)*norm(y)) , we calculate the cosine similarity between two vectors x & y in python.

How do you find the cosine similarity between two matrices?

Cosine similarity is simply the cosine of an angle between two given vectors, so it is a number between -1 and 1 . If you, however, use it on matrices (as above) and a and b have more than 1 rows, then you will get a matrix of all possible cosines (between each pair of rows between these matrices).

How do you find cosine similarity in NLP?

The formula for calculating Cosine similarity is given by The numerator denotes the dot product or the scalar product of these vectors and the denominator denotes the magnitude of these vectors. When we divide the dot product by the magnitude, we get the Cosine of the angle between them.


2 Answers

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:

from sklearn.metrics.pairwise import cosine_similarity from scipy import sparse  A =  np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]]) A_sparse = sparse.csr_matrix(A)  similarities = cosine_similarity(A_sparse) print('pairwise dense output:\n {}\n'.format(similarities))  #also can output sparse matrices similarities_sparse = cosine_similarity(A_sparse,dense_output=False) print('pairwise sparse output:\n {}\n'.format(similarities_sparse)) 

Results:

pairwise dense output: [[ 1.          0.40824829  0.40824829] [ 0.40824829  1.          0.33333333] [ 0.40824829  0.33333333  1.        ]]  pairwise sparse output: (0, 1)  0.408248290464 (0, 2)  0.408248290464 (0, 0)  1.0 (1, 0)  0.408248290464 (1, 2)  0.333333333333 (1, 1)  1.0 (2, 1)  0.333333333333 (2, 0)  0.408248290464 (2, 2)  1.0 

If you want column-wise cosine similarities simply transpose your input matrix beforehand:

A_sparse.transpose() 
like image 68
Jeff Avatar answered Sep 30 '22 13:09

Jeff


The following method is about 30 times faster than scipy.spatial.distance.pdist. It works pretty quickly on large matrices (assuming you have enough RAM)

See below for a discussion of how to optimize for sparsity.

import numpy as np  # base similarity matrix (all dot products) # replace this with A.dot(A.T).toarray() for sparse representation similarity = np.dot(A, A.T)  # squared magnitude of preference vectors (number of occurrences) square_mag = np.diag(similarity)  # inverse squared magnitude inv_square_mag = 1 / square_mag  # if it doesn't occur, set it's inverse magnitude to zero (instead of inf) inv_square_mag[np.isinf(inv_square_mag)] = 0  # inverse of the magnitude inv_mag = np.sqrt(inv_square_mag)      # cosine similarity (elementwise multiply by inverse magnitudes) cosine = similarity * inv_mag cosine = cosine.T * inv_mag 

If your problem is typical for large scale binary preference problems, you have a lot more entries in one dimension than the other. Also, the short dimension is the one whose entries you want to calculate similarities between. Let's call this dimension the 'item' dimension.

If this is the case, list your 'items' in rows and create A using scipy.sparse. Then replace the first line as indicated.

If your problem is atypical you'll need more modifications. Those should be pretty straightforward replacements of basic numpy operations with their scipy.sparse equivalents.

like image 20
Waylon Flinn Avatar answered Sep 30 '22 14:09

Waylon Flinn