Their goals are all the same: to find similar vectors. Which do you use in which situation? (any practical examples?)
The two quantities represent two different physical entities. The cosine similarity computes the similarity between two samples, whereas the Pearson correlation coefficient computes the correlation between two jointly distributed random variables.
The Euclidean distance corresponds to the L2-norm of a difference between vectors. The cosine similarity is proportional to the dot product of two vectors and inversely proportional to the product of their magnitudes.
However, in such circumstances, cosine similarity is bijective with Euclidean distance, so there's no real advantage to one over the other theoretically; in practice, cosine similarity is faster then.
Whereas euclidean distance was the sum of squared differences, correlation is basically the average product. There is a further relationship between the two. If we expand the formula for euclidean distance, we get this: But if X and Y are standardized, the sums Σx2 and Σy2 are both equal to n.
Pearson correlation and cosine similarity are invariant to scaling, i.e. multiplying all elements by a nonzero constant. Pearson correlation is also invariant to adding any constant to all elements. For example, if you have two vectors X1 and X2, and your Pearson correlation function is called pearson()
, pearson(X1, X2) == pearson(X1, 2 * X2 + 3)
. This is a pretty important property because you often don't care that two vectors are similar in absolute terms, only that they vary in the same way.
The difference between Pearson Correlation Coefficient and Cosine Similarity can be seen from their formulas:
The reason Pearson Correlation Coefficient is invariant to adding any constant is that the means are subtracted out by construction. It is also easy to see that Pearson Correlation Coefficient and Cosine Similarity are equivalent when X
and Y
have means of 0
, so we can think of Pearson Correlation Coefficient as demeaned version of Cosine Similarity.
For practical usage, let's consider returns of the two assets x
and y
:
In [275]: pylab.show() In [276]: x = np.array([0.1, 0.2, 0.1, -0.1, 0.5]) In [277]: y = x + 0.1
These asset's returns have exactly the same variability, which is measured by Pearson Correlation Coefficient (1), but they are not exactly similar which is measured by cosine similarity (0.971).
In [281]: np.corrcoef([x, y]) Out[281]: array([[ 1., 1.], # The off diagonal are correlations [ 1., 1.]]) # between x and y In [282]: from sklearn.metrics.pairwise import cosine_similarity In [283]: cosine_similarity(x, z) Out[283]: array([[ 0.97128586]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With