Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to know when to use a particular kind of Similarity index? Euclidean Distance vs. Pearson Correlation

What are some of the deciding factors to take into consideration when choosing a similarity index. In what cases is a Euclidean Distance preferred over Pearson and vice versa?

like image 295
Horse Voice Avatar asked Nov 10 '12 17:11

Horse Voice


2 Answers

It really depends on the application scenario you have in hand. Very briefly, if you are dealing with data where the actual difference in values of attributes is important, go with Euclidean Distance. If you are looking for trend or shape similarity, then go with correlation. Also note, that if you perform z-score normalization in each object, Euclidean Distance behaves similarly to Pearson correlation coefficient. Pearson is not sensitive to linear transformations of the data. There are other types of correlation coefficients that take into account the ranks of the values only, being insensitive to both linear and non linear transformations. Note that the usual use of correlation as dissimilarity is 1 - correlation, which does not respect all the rules for a metric distance.

There are some studies on which proximity measure select on a particular application, for instance:

Pablo A. Jaskowiak, Ricardo J. G. B. Campello, Ivan G. Costa Filho, "Proximity Measures for Clustering Gene Expression Microarray Data: A Validation Methodology and a Comparative Analysis," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 99, no. PrePrints, p. 1, , 2013

like image 107
John D Avatar answered Oct 12 '22 23:10

John D


Correlation is unit independent; if you scale one of the objects ten times, you will get different euclidean distances and same correlation distances. Therefore, correlation metrics is excellent when you want to measure distance between such objects as genes defined by their expression profile.

Often, absolute or squared correlation is used as a distance metrics, because we are more interested in the strength of the relationship than in its sign.

However, correlation is only suitable for highly dimensional data; there is hardly a point of calculating it for two- or three dimensional data points.

Also note that "Pearson distance" is a weighted type of Euclidean distance, and not the "correlation distance" using Pearson correlation coefficient.

like image 20
January Avatar answered Oct 13 '22 01:10

January