Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Euclidean distance vs Pearson correlation vs cosine similarity?

Their goals are all the same: to find similar vectors. Which do you use in which situation? (any practical examples?)

like image 699
TIMEX Avatar asked Dec 03 '09 09:12

TIMEX


People also ask

Is cosine similarity the same as Pearson correlation?

The two quantities represent two different physical entities. The cosine similarity computes the similarity between two samples, whereas the Pearson correlation coefficient computes the correlation between two jointly distributed random variables.

Is cosine similarity same as Euclidean distance?

The Euclidean distance corresponds to the L2-norm of a difference between vectors. The cosine similarity is proportional to the dot product of two vectors and inversely proportional to the product of their magnitudes.

Is cosine similarity faster than Euclidean distance?

However, in such circumstances, cosine similarity is bijective with Euclidean distance, so there's no real advantage to one over the other theoretically; in practice, cosine similarity is faster then.

What is the difference between Euclidean distance and correlation?

Whereas euclidean distance was the sum of squared differences, correlation is basically the average product. There is a further relationship between the two. If we expand the formula for euclidean distance, we get this: But if X and Y are standardized, the sums Σx2 and Σy2 are both equal to n.


2 Answers

Pearson correlation and cosine similarity are invariant to scaling, i.e. multiplying all elements by a nonzero constant. Pearson correlation is also invariant to adding any constant to all elements. For example, if you have two vectors X1 and X2, and your Pearson correlation function is called pearson(), pearson(X1, X2) == pearson(X1, 2 * X2 + 3). This is a pretty important property because you often don't care that two vectors are similar in absolute terms, only that they vary in the same way.

like image 88
dsimcha Avatar answered Sep 18 '22 04:09

dsimcha


The difference between Pearson Correlation Coefficient and Cosine Similarity can be seen from their formulas:

enter image description here

The reason Pearson Correlation Coefficient is invariant to adding any constant is that the means are subtracted out by construction. It is also easy to see that Pearson Correlation Coefficient and Cosine Similarity are equivalent when X and Y have means of 0, so we can think of Pearson Correlation Coefficient as demeaned version of Cosine Similarity.

For practical usage, let's consider returns of the two assets x and y:

In [275]: pylab.show()  In [276]: x = np.array([0.1, 0.2, 0.1, -0.1, 0.5])  In [277]: y = x + 0.1 

enter image description here

These asset's returns have exactly the same variability, which is measured by Pearson Correlation Coefficient (1), but they are not exactly similar which is measured by cosine similarity (0.971).

In [281]: np.corrcoef([x, y]) Out[281]:  array([[ 1.,  1.],   # The off diagonal are correlations         [ 1.,  1.]])  # between x and y  In [282]: from sklearn.metrics.pairwise import cosine_similarity  In [283]: cosine_similarity(x, z) Out[283]: array([[ 0.97128586]]) 
like image 45
Akavall Avatar answered Sep 20 '22 04:09

Akavall