Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cosine similarity optimized implementation

Tags:

I am trying to understand this optimized code to find cosine similarity between users matrix.

def fast_similarity(ratings,epsilon=1e-9):
    # epsilon -> small number for handling dived-by-zero errors
    sim = ratings.T.dot(ratings) + epsilon
    norms = np.array([np.sqrt(np.diagonal(sim))])
    return (sim / norms / norms.T)

If ratings =

           items           
     u  [
     s    [1,2,3]
     e    [4,5,6]
     r    [7,8,9] 
     s  ]

nomrs will be equal to = [1^2 + 5^2 + 9^2]

but why we are writing sim/norms/norms.T to calculate cosine similarity? Any help is appreciated.

like image 577
Manish Kumar Avatar asked Mar 29 '17 07:03

Manish Kumar


People also ask

Why use cosine similarity instead of Euclidean distance?

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance because of the size (like, the word 'cricket' appeared 50 times in one document and 10 times in another) they could still have a smaller angle between them. Smaller the angle, higher the similarity.

What is a good cosine similarity?

The higher similarity, the lower distances. When you pick the threshold for similarities for text/documents, usually a value higher than 0.5 shows strong similarities.

Is cosine similarity a machine learning algorithm?

Cosine similarity is used as a metric in different machine learning algorithms like the KNN for determining the distance between the neighbors, in recommendation systems, it is used to recommend movies with the same similarities and for textual data, it is used to find the similarity of texts in the document.

How do you implement cosine similarity in Python?

We use the below formula to compute the cosine similarity. where A and B are vectors: A.B is dot product of A and B: It is computed as sum of element-wise product of A and B. ||A|| is L2 norm of A: It is computed as square root of the sum of squares of elements of the vector A.


1 Answers

Going through the code we have that:

first

And this means that, one the diagonal of the sim matrix we have the result of the multiplication of each column.

You can give it a try if you want using a simple matrix:

second

And you can easily check that this gram matrix (that's how this matrix product is named) has this property.

Now the code defines norms that is nothing but an array taking the diagonal of our gram matrix and apply a sqrt on each element of it.

This will give us an array containing the norm value for each column:

third

So basically the norms vector contains the norm value of each column of the result matrix.

Once we have all those data we can evaluate the cosine similarity between those users, so we know that cosine similarity is evaluated like:

forth

Note that : fifth

So we have that our similarity is going to be:

six

So we just have to substitute the terms with our code variable to get:

seven

And this explain why you have this line of code:

return sim / norms / norms.T

EDIT: Since it seems that I was not clear, every time I am talking about matrix multiplication in this answer I am reffering to the DOT PRODUCT of two matrices.

This actually means that when it's written A*B we actually develop and solve as A.T * B

like image 179
rakwaht Avatar answered Sep 23 '22 09:09

rakwaht