I want to calculate cosine similarity between different rows of a matrix in matlab. I wrote the following code in matlab:
for i = 1:n_row
for j = i:n_row
S2(i,j) = dot(S1(i,:), S1(j,:)) / (norm_r(i) * norm_r(j));
S2(j,i) = S2(i,j);
matrix S1 is 11000*11000 and the code execution is very time consuming. So, I want to know Is there any function in matlab to calculate the cosine similarity between matrix rows faster than the above code?
Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.
Y = cos( X ) returns the cosine for each element of X . The cos function operates element-wise on arrays. The function accepts both real and complex inputs. For real values of X , cos(X) returns real values in the interval [-1, 1].
Tf-idf is a transformation you apply to texts to get two real-valued vectors. You can then obtain the cosine similarity of any pair of vectors by taking their dot product and dividing that by the product of their norms. That yields the cosine of the angle between the vectors. where θ is the angle between the vectors.
Short version by calculating the similarity with pdist
:
S2 = squareform(1-pdist(S1,'cosine')) + eye(size(S1,1));
pdist(S1,'cosine')
calculates the cosine distance between all combinations of rows in S1
. Therefore the similarity between all combinations is 1 - pdist(S1,'cosine')
.
We can turn that into a square matrix where element (i,j)
corresponds to the similarity between rows i
and j
with squareform(1-pdist(S1,'cosine'))
.
Finally we have to set the main diagonal to 1 because the similaritiy of a row with itself is obviously 1 but that is not explicitly calculated by pdist
.
Your code loops over all rows, and for each row loops over (about) half the rows, computing the dot product for each unique combination of rows:
n_row = size(S1,1);
norm_r = sqrt(sum(abs(S1).^2,2)); % same as norm(S1,2,'rows')
S2 = zeros(n_row,n_row);
for i = 1:n_row
for j = i:n_row
S2(i,j) = dot(S1(i,:), S1(j,:)) / (norm_r(i) * norm_r(j));
S2(j,i) = S2(i,j);
end
end
(I've taken the liberty to complete your code so it actually runs. Note the initialization of S2
before the loop, this saves a lot of time!)
If you note that the dot product is a matrix product of a row vector with a column vector, you can see that the above, without the normalization step, is identical to
S2 = S1 * S1.';
This runs much faster than the explicit loop, even if it is (maybe?) not able to use the symmetry. The normalization is simply dividing each row by norm_r
and each column by norm_r
. Here I multiply the two vectors to produce a square matrix to normalize with:
S2 = (S1 * S1.') ./ (norm_r * norm_r.');
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With