I have a pandas dataframe that looks as follows: <pre class="prettyprint"><code>In [23]: dataframe.head() Out[23]: column_id 1 10 11 12 13 14 15 16 17 18 ... 46 47 48 49 5 50 \ row_id ... 1 NaN NaN 1 1 1 1 1 1 1 1 ... 1 1 NaN 1 NaN NaN 10 1 1 1 1 1 1 1 1 1 NaN ... 1 1 1 NaN 1 NaN 100 1 1 NaN 1 1 1 1 1 NaN 1 ... NaN NaN 1 1 1 NaN 11 NaN 1 1 1 1 1 1 1 1 NaN ... NaN 1 1 1 1 1 12 1 1 1 NaN 1 1 1 1 NaN 1 ... 1 NaN 1 1 NaN 1 </code></pre> The thing is I'm currently using the Pearson correlation to calculate similarity between rows, and given the nature of the data, sometimes std deviation is zero (all values are 1 or NaN), so the pearson correlation returns this: <pre class="prettyprint"><code>In [24]: dataframe.transpose().corr().head() Out[24]: row_id 1 10 100 11 12 13 14 15 16 17 ... 90 91 92 93 94 95 \ row_id ... 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 100 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 11 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN </code></pre> Is there any other way of computing correlations that avoids this? Maybe an easy way to calculate the euclidean distance between rows with just one method, just as Pearson correlation has? Thanks! A.

The key question here is what distance metric to use. Let's say this is your data. <pre class="prettyprint"><code>>>> import pandas as pd >>> data = pd.DataFrame(pd.np.random.rand(100, 50)) >>> data[data > 0.2] = 1 >>> data[data <= 0.2] = pd.np.nan >>> data.head() 0 1 2 3 4 5 6 7 8 9 ... 40 41 42 43 44 45 46 47 \ 0 1 1 1 NaN 1 NaN NaN 1 1 1 ... 1 1 NaN 1 NaN 1 1 1 1 1 1 1 NaN 1 1 1 1 1 1 ... NaN 1 1 NaN NaN 1 1 1 2 1 1 1 1 1 1 1 1 1 1 ... 1 NaN 1 1 1 1 1 NaN 3 1 NaN 1 NaN 1 NaN 1 NaN 1 1 ... 1 1 1 1 NaN 1 1 1 4 1 1 1 1 1 1 1 1 NaN 1 ... NaN 1 1 1 1 1 1 1 </code></pre> <h3>What is the % difference?</h3> You can compute a distance metric as percentage of values that are different between each column. The result shows the % difference between any 2 columns. <pre class="prettyprint"><code>>>> zero_data = data.fillna(0) >>> distance = lambda column1, column2: (column1 - column2).abs().sum() / len(column1) >>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2))) >>> result.head() 0 1 2 3 4 5 6 7 8 9 ... 40 \ 0 0.00 0.36 0.33 0.37 0.32 0.41 0.35 0.33 0.39 0.33 ... 0.37 1 0.36 0.00 0.37 0.29 0.30 0.37 0.33 0.37 0.33 0.31 ... 0.35 2 0.33 0.37 0.00 0.36 0.29 0.38 0.40 0.34 0.30 0.28 ... 0.28 3 0.37 0.29 0.36 0.00 0.29 0.30 0.34 0.26 0.32 0.36 ... 0.36 4 0.32 0.30 0.29 0.29 0.00 0.31 0.35 0.29 0.29 0.25 ... 0.27 </code></pre> <h3>What is the correlation coefficient?</h3> Here, we use the Pearson correlation coefficient. This is a perfectly valid metric. Specifically, it translates to the phi coefficient in case of binary data. <pre class="prettyprint"><code>>>> zero_data = data.fillna(0) >>> distance = lambda column1, column2: scipy.stats.pearsonr(column1, column2)[0] >>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2))) >>> result.head() 0 1 2 3 4 5 6 \ 0 1.000000 0.013158 0.026262 -0.059786 -0.024293 -0.078056 0.054074 1 0.013158 1.000000 -0.093109 0.170159 0.043187 0.027425 0.108148 2 0.026262 -0.093109 1.000000 -0.124540 -0.048485 -0.064881 -0.161887 3 -0.059786 0.170159 -0.124540 1.000000 0.004245 0.184153 0.042524 4 -0.024293 0.043187 -0.048485 0.004245 1.000000 0.079196 -0.099834 </code></pre> Incidentally, this is the same result that you would get with the Spearman R coefficient as well. <h3>What is the Euclidean distance?</h3> <pre class="prettyprint"><code>>>> zero_data = data.fillna(0) >>> distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2) >>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2))) >>> result.head() 0 1 2 3 4 5 6 \ 0 0.000000 6.000000 5.744563 6.082763 5.656854 6.403124 5.916080 1 6.000000 0.000000 6.082763 5.385165 5.477226 6.082763 5.744563 2 5.744563 6.082763 0.000000 6.000000 5.385165 6.164414 6.324555 3 6.082763 5.385165 6.000000 0.000000 5.385165 5.477226 5.830952 4 5.656854 5.477226 5.385165 5.385165 0.000000 5.567764 5.916080 </code></pre> By now, you'd have a sense of the pattern. Create a <code>distance</code> method. Then apply it pairwise to every column using <pre class="prettyprint"><code>data.apply(lambda col1: data.apply(lambda col2: method(col1, col2))) </code></pre> If your <code>distance</code> method relies on the presence of zeroes instead of <code>nan</code>s, convert to zeroes using <code>.fillna(0)</code>.

A proposal to improve the excellent answer from @s-anand for Euclidian distance: instead of <pre class="prettyprint"><code>zero_data = data.fillna(0) distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2) </code></pre> we can apply the fillna the fill only the missing data, thus: <pre class="prettyprint"><code>distance = lambda column1, column2: pd.np.linalg.norm((column1 - column2).fillna(0)) </code></pre> This way, the distance on missing dimensions will not be counted.

Distance matrix for rows in pandas dataframe

Q: How do you find the Euclidean distance between two rows?

The Euclidean distance is simply the square root of the squared differences between corresponding elements of the rows (or columns). This is probably the most commonly used distance metric.

Tags:

python

pandas

numpy

I have a pandas dataframe that looks as follows:

In [23]: dataframe.head()
Out[23]: 
column_id   1  10  11  12  13  14  15  16  17  18 ...  46  47  48  49   5  50  \
row_id                                            ...                           
1         NaN NaN   1   1   1   1   1   1   1   1 ...   1   1 NaN   1 NaN NaN   
10          1   1   1   1   1   1   1   1   1 NaN ...   1   1   1 NaN   1 NaN   
100         1   1 NaN   1   1   1   1   1 NaN   1 ... NaN NaN   1   1   1 NaN   
11        NaN   1   1   1   1   1   1   1   1 NaN ... NaN   1   1   1   1   1   
12          1   1   1 NaN   1   1   1   1 NaN   1 ...   1 NaN   1   1 NaN   1

The thing is I'm currently using the Pearson correlation to calculate similarity between rows, and given the nature of the data, sometimes std deviation is zero (all values are 1 or NaN), so the pearson correlation returns this:

In [24]: dataframe.transpose().corr().head()
Out[24]: 
row_id   1  10  100  11  12  13  14  15  16  17 ...  90  91  92  93  94  95  \
row_id                                          ...                           
1      NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
10     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
100    NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
11     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN   
12     NaN NaN  NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN

Is there any other way of computing correlations that avoids this? Maybe an easy way to calculate the euclidean distance between rows with just one method, just as Pearson correlation has?

Thanks!

302

asked Apr 18 '15 22:04

misterte

2 Answers

The key question here is what distance metric to use.

Let's say this is your data.

>>> import pandas as pd
>>> data = pd.DataFrame(pd.np.random.rand(100, 50))
>>> data[data > 0.2] = 1
>>> data[data <= 0.2] = pd.np.nan
>>> data.head()
   0   1   2   3   4   5   6   7   8   9  ...  40  41  42  43  44  45  46  47  \
0   1   1   1 NaN   1 NaN NaN   1   1   1 ...   1   1 NaN   1 NaN   1   1   1
1   1   1   1 NaN   1   1   1   1   1   1 ... NaN   1   1 NaN NaN   1   1   1
2   1   1   1   1   1   1   1   1   1   1 ...   1 NaN   1   1   1   1   1 NaN
3   1 NaN   1 NaN   1 NaN   1 NaN   1   1 ...   1   1   1   1 NaN   1   1   1
4   1   1   1   1   1   1   1   1 NaN   1 ... NaN   1   1   1   1   1   1   1

What is the % difference?

You can compute a distance metric as percentage of values that are different between each column. The result shows the % difference between any 2 columns.

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: (column1 - column2).abs().sum() / len(column1)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
     0     1     2     3     4     5     6     7     8     9   ...     40  \
0  0.00  0.36  0.33  0.37  0.32  0.41  0.35  0.33  0.39  0.33  ...   0.37
1  0.36  0.00  0.37  0.29  0.30  0.37  0.33  0.37  0.33  0.31  ...   0.35
2  0.33  0.37  0.00  0.36  0.29  0.38  0.40  0.34  0.30  0.28  ...   0.28
3  0.37  0.29  0.36  0.00  0.29  0.30  0.34  0.26  0.32  0.36  ...   0.36
4  0.32  0.30  0.29  0.29  0.00  0.31  0.35  0.29  0.29  0.25  ...   0.27

What is the correlation coefficient?

Here, we use the Pearson correlation coefficient. This is a perfectly valid metric. Specifically, it translates to the phi coefficient in case of binary data.

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: scipy.stats.pearsonr(column1, column2)[0]
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
         0         1         2         3         4         5         6   \
0  1.000000  0.013158  0.026262 -0.059786 -0.024293 -0.078056  0.054074
1  0.013158  1.000000 -0.093109  0.170159  0.043187  0.027425  0.108148
2  0.026262 -0.093109  1.000000 -0.124540 -0.048485 -0.064881 -0.161887
3 -0.059786  0.170159 -0.124540  1.000000  0.004245  0.184153  0.042524
4 -0.024293  0.043187 -0.048485  0.004245  1.000000  0.079196 -0.099834

Incidentally, this is the same result that you would get with the Spearman R coefficient as well.

What is the Euclidean distance?

>>> zero_data = data.fillna(0)
>>> distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
>>> result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
>>> result.head()
         0         1         2         3         4         5         6   \
0  0.000000  6.000000  5.744563  6.082763  5.656854  6.403124  5.916080
1  6.000000  0.000000  6.082763  5.385165  5.477226  6.082763  5.744563
2  5.744563  6.082763  0.000000  6.000000  5.385165  6.164414  6.324555
3  6.082763  5.385165  6.000000  0.000000  5.385165  5.477226  5.830952
4  5.656854  5.477226  5.385165  5.385165  0.000000  5.567764  5.916080

By now, you'd have a sense of the pattern. Create a distance method. Then apply it pairwise to every column using

data.apply(lambda col1: data.apply(lambda col2: method(col1, col2)))

If your distance method relies on the presence of zeroes instead of nans, convert to zeroes using .fillna(0).

133

answered Sep 24 '22 00:09

S Anand

A proposal to improve the excellent answer from @s-anand for Euclidian distance: instead of

zero_data = data.fillna(0)
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)

we can apply the fillna the fill only the missing data, thus:

distance = lambda column1, column2: pd.np.linalg.norm((column1 - column2).fillna(0))

This way, the distance on missing dimensions will not be counted.

answered Sep 22 '22 00:09

maparent

Related questions
                            
                                Get the indices of N highest values in an ndarray
                            
                                Query multiple values at a time pymongo
                            
                                google spreadsheets gspread append_row issue
                            
                                Bootstrap Carousel Implementation in Django
                            
                                python xlsxwriter change all cell widths when using write_row
                            
                                What is the difference between super() being called at the beginning or end of a method?
                            
                                django dynamic related name on FK model inhertiance
                            
                                Python check exit status of a shell command
                            
                                In django, how can I filter or exclude multiple things?
                            
                                Track value changes in a repetitive list in Python
                            
                                fitting a circle to a binary image
                            
                                Precision of repr(f), str(f), print(f) when f is float
                            
                                Drop rows if value in a specific column is not an integer in pandas dataframe
                            
                                When I am importing `http.server` from the idle it works, but when I run a python file having `import http.server` there is an error
                            
                                Error while fetching Tweets with Tweepy
                            
                                PyMongo raises [errno 49] can't assign requested address after a large number of queries
                            
                                Cassandra query making - Cannot execute this query as it might involve data filtering and thus may have unpredictable performance
                            
                                NumPy random seed produces different random numbers
                            
                                Change multiple items in a list at one time in Python
                            
                                re.sub on lists - python 3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With