Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scipy: distance correlation is higher than 1

I'm trying to find distance correlation between columns, look at the code below. Most of time it returns higher than 1 result, which is not possible, because distance correlation is between 0 and 1. You can read about scipy's distance correlation here.

import numpy as np
from scipy.spatial import distance

x = np.random.uniform(-1, 1, 10000)
print distance.correlation(x, x**2)

1.00210811815

What is wrong here or how can I measure it?

upd1: Link to issue on github

like image 494
Rocketq Avatar asked Mar 14 '16 13:03

Rocketq


People also ask

How do you calculate distance correlation?

The distance covariance between random vectors X and Y has the following properties: X and Y are independent if and only if dCov(X,Y) = 0. You can define the distance variance dVar(X) = dCov(X,X) and the distance correlation as dCor(X,Y) = dCov(X,Y) / sqrt( dVar(X) dVar(Y) ) when both variances are positive.

Is correlation a distance measure?

Correlation distance is a popular way of measuring the distance between two random variables with finite variances¹. If the correlation² between two random variables is r, then their correlation distance is defined as d=1-r.

What is Pearson correlation distance?

Thus, distance correlation measures both linear and nonlinear association between two random variables or random vectors. This is in contrast to Pearson's correlation, which can only detect linear association between two random variables.

What is Cdist in Scipy?

scipy. stats. cdist(array, axis=0) function calculates the distance between each pair of the two collections of inputs. Parameters : array: Input array or object having the elements to calculate the distance between each pair of the two collections of inputs.


2 Answers

Correlational distance is the inverse of correlation and only looks at the angle/similarity among patterns (sort of like normalization). Correlational distance goes from 0 - 2, with 0 being PERFECT correlation, 1 being no correlation, and 2 being PERFECT ANTICORRELATION. So a small correlational distance value means close together in correlational space (small angular difference). Corr = 1 – dist; Corr dist = 1 – corr; so while a high correlation = high relationship; LOW CORR DISTINANCE = high relationship

like image 125
b1234 Avatar answered Oct 22 '22 20:10

b1234


I don't see why this is a problem according to the documentation.

From the documentation:

The correlation distance between u and v, is defined as 1 - \frac{(u - \bar{u}) \cdot (v - \bar{v})} {{||(u - \bar{u})||}_2 {||(v - \bar{v})||}_2}

By the Cauchy-Schwarz Inequality, the expression following the minus sign has an absolute value that is at most 1. There is nothing stipulating that it won't be negative, though - in fact, this will happen if the (mean normalized) vectors are anticorrelated.

AFAICT, you should be surprised if you'd get a value larger than 2 or smaller than 0. Using the comment by @Cleb and the fact that the range is [0, 2], I'm guessing that some other packages simply define the distance as half this expression.

like image 33
Ami Tavory Avatar answered Oct 22 '22 21:10

Ami Tavory