I'm trying to find distance correlation between columns, look at the code below. Most of time it returns higher than 1 result, which is not possible, because distance correlation is between 0 and 1. You can read about scipy's distance correlation here.
import numpy as np
from scipy.spatial import distance
x = np.random.uniform(-1, 1, 10000)
print distance.correlation(x, x**2)
1.00210811815
What is wrong here or how can I measure it?
upd1: Link to issue on github
The distance covariance between random vectors X and Y has the following properties: X and Y are independent if and only if dCov(X,Y) = 0. You can define the distance variance dVar(X) = dCov(X,X) and the distance correlation as dCor(X,Y) = dCov(X,Y) / sqrt( dVar(X) dVar(Y) ) when both variances are positive.
Correlation distance is a popular way of measuring the distance between two random variables with finite variances¹. If the correlation² between two random variables is r, then their correlation distance is defined as d=1-r.
Thus, distance correlation measures both linear and nonlinear association between two random variables or random vectors. This is in contrast to Pearson's correlation, which can only detect linear association between two random variables.
scipy. stats. cdist(array, axis=0) function calculates the distance between each pair of the two collections of inputs. Parameters : array: Input array or object having the elements to calculate the distance between each pair of the two collections of inputs.
Correlational distance is the inverse of correlation and only looks at the angle/similarity among patterns (sort of like normalization). Correlational distance goes from 0 - 2, with 0 being PERFECT correlation, 1 being no correlation, and 2 being PERFECT ANTICORRELATION. So a small correlational distance value means close together in correlational space (small angular difference). Corr = 1 – dist; Corr dist = 1 – corr; so while a high correlation = high relationship; LOW CORR DISTINANCE = high relationship
I don't see why this is a problem according to the documentation.
From the documentation:
The correlation distance between u and v, is defined as 1 - \frac{(u - \bar{u}) \cdot (v - \bar{v})} {{||(u - \bar{u})||}_2 {||(v - \bar{v})||}_2}
By the Cauchy-Schwarz Inequality, the expression following the minus sign has an absolute value that is at most 1. There is nothing stipulating that it won't be negative, though - in fact, this will happen if the (mean normalized) vectors are anticorrelated.
AFAICT, you should be surprised if you'd get a value larger than 2 or smaller than 0. Using the comment by @Cleb and the fact that the range is [0, 2], I'm guessing that some other packages simply define the distance as half this expression.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With