Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cosine similarity built-in function in matlab

I want to calculate cosine similarity between different rows of a matrix in matlab. I wrote the following code in matlab:

for i = 1:n_row
    for j = i:n_row
        S2(i,j) = dot(S1(i,:), S1(j,:)) / (norm_r(i) * norm_r(j));
        S2(j,i) = S2(i,j);

matrix S1 is 11000*11000 and the code execution is very time consuming. So, I want to know Is there any function in matlab to calculate the cosine similarity between matrix rows faster than the above code?

like image 808
Mehdi Avatar asked Jan 04 '18 18:01

Mehdi


People also ask

What is cosine similarity function?

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

How do you do cosine in Matlab?

Y = cos( X ) returns the cosine for each element of X . The cos function operates element-wise on arrays. The function accepts both real and complex inputs. For real values of X , cos(X) returns real values in the interval [-1, 1].

How do you do cosine similarity with TF IDF?

Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors. where θ is the angle between the vectors.


2 Answers

Short version by calculating the similarity with pdist:

S2 = squareform(1-pdist(S1,'cosine')) + eye(size(S1,1));

Explanation:

pdist(S1,'cosine') calculates the cosine distance between all combinations of rows in S1. Therefore the similarity between all combinations is 1 - pdist(S1,'cosine') .

We can turn that into a square matrix where element (i,j) corresponds to the similarity between rows i and j with squareform(1-pdist(S1,'cosine')).

Finally we have to set the main diagonal to 1 because the similaritiy of a row with itself is obviously 1 but that is not explicitly calculated by pdist.

like image 75
Leander Moesinger Avatar answered Oct 15 '22 19:10

Leander Moesinger


Your code loops over all rows, and for each row loops over (about) half the rows, computing the dot product for each unique combination of rows:

n_row = size(S1,1);
norm_r = sqrt(sum(abs(S1).^2,2)); % same as norm(S1,2,'rows')
S2 = zeros(n_row,n_row);
for i = 1:n_row
  for j = i:n_row
    S2(i,j) = dot(S1(i,:), S1(j,:)) / (norm_r(i) * norm_r(j));
    S2(j,i) = S2(i,j);
  end
end

(I've taken the liberty to complete your code so it actually runs. Note the initialization of S2 before the loop, this saves a lot of time!)

If you note that the dot product is a matrix product of a row vector with a column vector, you can see that the above, without the normalization step, is identical to

S2 = S1 * S1.';

This runs much faster than the explicit loop, even if it is (maybe?) not able to use the symmetry. The normalization is simply dividing each row by norm_r and each column by norm_r. Here I multiply the two vectors to produce a square matrix to normalize with:

S2 = (S1 * S1.') ./ (norm_r * norm_r.');
like image 38
Cris Luengo Avatar answered Oct 15 '22 20:10

Cris Luengo