I have a pandas dataframe that looks as follows:
In [23]: dataframe.head()
Out[23]:
column_id 1 10 11 12 13 14 15 16 17 18 ... 46 47 48 49 5 50 \
row_id ...
1 NaN NaN 1 1 1 1 1 1 1 1 ... 1 1 NaN 1 NaN NaN
10 1 1 1 1 1 1 1 1 1 NaN ... 1 1 1 NaN 1 NaN
100 1 1 NaN 1 1 1 1 1 NaN 1 ... NaN NaN 1 1 1 NaN
11 NaN 1 1 1 1 1 1 1 1 NaN ... NaN 1 1 1 1 1
12 1 1 1 NaN 1 1 1 1 NaN 1 ... 1 NaN 1 1 NaN 1
The thing is I'm currently using the Pearson correlation to calculate similarity between rows, and given the nature of the data, sometimes std deviation is zero (all values are 1 or NaN), so the pearson correlation returns this:
In [24]: dataframe.transpose().corr().head()
Out[24]:
row_id 1 10 100 11 12 13 14 15 16 17 ... 90 91 92 93 94 95 \
row_id ...
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
Is there any other way of computing correlations that avoids this? Maybe an easy way to calculate the euclidean distance between rows with just one method, just as Pearson correlation has?
Thanks!
A.
distance_matrix(x, y, p=2) Parameters: x : (M, K) Matrix of M vectors, each of dimension K. y : (N, K) Matrix of N vectors, each of dimension K. p : float, 1 <= p <= infinity, defines which Minkowski p-norm to use. Returns: (M, N) ndarray / matrix containing the distance from every vector in x to every vector in y.
The Euclidean distance is simply the square root of the squared differences between corresponding elements of the rows (or columns). This is probably the most commonly used distance metric.
The key question here is what distance metric to use.
Let's say this is your data.
>>> import pandas as pd
>>> data = pd.DataFrame(pd.np.random.rand(100, 50))
>>> data[data > 0.2] = 1
>>> data[data <= 0.2] = pd.np.nan
>>> data.head()
0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 \
0 1 1 1 NaN 1 NaN NaN 1 1 1 ... 1 1 NaN 1 NaN 1 1 1
1 1 1 1 NaN 1 1 1 1 1 1 ... NaN 1 1 NaN NaN 1 1 1
2 1 1 1 1 1 1 1 1 1 1 ... 1 NaN 1 1 1 1 1 NaN
3 1 NaN 1 NaN 1 NaN 1 NaN 1 1 ... 1 1 1 1 NaN 1 1 1
4 1 1 1 1 1 1 1 1 NaN 1 ... NaN 1 1 1 1 1 1 1
You can compute a distance metric as percentage of values that are different between each column. The result shows the % difference between any 2 columns.
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: (column1 - column2).abs().sum() / len(column1)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 7 8 9 ... 40 \
0 0.00 0.36 0.33 0.37 0.32 0.41 0.35 0.33 0.39 0.33 ... 0.37
1 0.36 0.00 0.37 0.29 0.30 0.37 0.33 0.37 0.33 0.31 ... 0.35
2 0.33 0.37 0.00 0.36 0.29 0.38 0.40 0.34 0.30 0.28 ... 0.28
3 0.37 0.29 0.36 0.00 0.29 0.30 0.34 0.26 0.32 0.36 ... 0.36
4 0.32 0.30 0.29 0.29 0.00 0.31 0.35 0.29 0.29 0.25 ... 0.27
Here, we use the Pearson correlation coefficient. This is a perfectly valid metric. Specifically, it translates to the phi coefficient in case of binary data.
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: scipy.stats.pearsonr(column1, column2)[0]
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 \
0 1.000000 0.013158 0.026262 -0.059786 -0.024293 -0.078056 0.054074
1 0.013158 1.000000 -0.093109 0.170159 0.043187 0.027425 0.108148
2 0.026262 -0.093109 1.000000 -0.124540 -0.048485 -0.064881 -0.161887
3 -0.059786 0.170159 -0.124540 1.000000 0.004245 0.184153 0.042524
4 -0.024293 0.043187 -0.048485 0.004245 1.000000 0.079196 -0.099834
Incidentally, this is the same result that you would get with the Spearman R coefficient as well.
>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
0 1 2 3 4 5 6 \
0 0.000000 6.000000 5.744563 6.082763 5.656854 6.403124 5.916080
1 6.000000 0.000000 6.082763 5.385165 5.477226 6.082763 5.744563
2 5.744563 6.082763 0.000000 6.000000 5.385165 6.164414 6.324555
3 6.082763 5.385165 6.000000 0.000000 5.385165 5.477226 5.830952
4 5.656854 5.477226 5.385165 5.385165 0.000000 5.567764 5.916080
By now, you'd have a sense of the pattern. Create a distance
method. Then apply it pairwise to every column using
data.apply(lambda col1: data.apply(lambda col2: method(col1, col2)))
If your distance
method relies on the presence of zeroes instead of nan
s, convert to zeroes using .fillna(0)
.
A proposal to improve the excellent answer from @s-anand for Euclidian distance: instead of
zero_data = data.fillna(0)
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
we can apply the fillna the fill only the missing data, thus:
distance = lambda column1, column2: pd.np.linalg.norm((column1 - column2).fillna(0))
This way, the distance on missing dimensions will not be counted.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With