Distance calculation between rows in Pandas Dataframe using a distance matrix

Tags:

I have the following Pandas DataFrame:

In [31]:
import pandas as pd
sample = pd.DataFrame({'Sym1': ['a','a','a','d'],'Sym2':['a','c','b','b'],'Sym3':['a','c','b','d'],'Sym4':['b','b','b','a']},index=['Item1','Item2','Item3','Item4'])
In [32]: print(sample)
Out [32]:
      Sym1 Sym2 Sym3 Sym4
Item1    a    a    a    b
Item2    a    c    c    b
Item3    a    b    b    b
Item4    d    b    d    a

and I want to find the elegant way to get the distance between each Item according to this distance matrix:

In [34]:
DistMatrix = pd.DataFrame({'a': [0,0,0.67,1.34],'b':[0,0,0,0.67],'c':[0.67,0,0,0],'d':[1.34,0.67,0,0]},index=['a','b','c','d'])
print(DistMatrix)
Out[34]:
      a     b     c     d
a  0.00  0.00  0.67  1.34
b  0.00  0.00  0.00  0.67
c  0.67  0.00  0.00  0.00
d  1.34  0.67  0.00  0.00

For example comparing Item1 to Item2 would compare aaab -> accb -- using the distance matrix this would be 0+0.67+0.67+0=1.34

Ideal output:

       Item1   Item2  Item3  Item4
Item1      0    1.34     0    2.68
Item2     1.34    0      0    1.34
Item3      0      0      0    2.01
Item4     2.68  1.34   2.01    0

659

asked Nov 30 '13 17:11

Clayton

3 Answers

This is an old question, but there is a Scipy function that does this:

from scipy.spatial.distance import pdist, squareform

distances = pdist(sample.values, metric='euclidean')
dist_matrix = squareform(distances)

pdist operates on Numpy matrices, and DataFrame.values is the underlying Numpy NDarray representation of the data frame. The metric argument allows you to select one of several built-in distance metrics, or you can pass in any binary function to use a custom distance. It's very powerful and, in my experience, very fast. The result is a "flat" array that consists only of the upper triangle of the distance matrix (because it's symmetric), not including the diagonal (because it's always 0). squareform then translates this flattened form into a full matrix.

The docs have more info, including a mathematical rundown of the many built-in distance functions.

192

answered Oct 05 '22 19:10

shadowtalker

this is doing twice as much work as needed, but technically works for non-symmetric distance matrices as well ( whatever that is supposed to mean )

pd.DataFrame ( { idx1: { idx2:sum( DistMatrix[ x ][ y ]
                                  for (x, y) in zip( row1, row2 ) ) 
                         for (idx2, row2) in sample.iterrows( ) } 
                 for (idx1, row1 ) in sample.iterrows( ) } )

you can make it more readable by writing it in pieces:

# a helper function to compute distance of two items
dist = lambda xs, ys: sum( DistMatrix[ x ][ y ] for ( x, y ) in zip( xs, ys ) )

# a second helper function to compute distances from a given item
xdist = lambda x: { idx: dist( x, y ) for (idx, y) in sample.iterrows( ) }

# the pairwise distance matrix
pd.DataFrame( { idx: xdist( x ) for ( idx, x ) in sample.iterrows( ) } )

answered Oct 05 '22 20:10

behzad.nouri

For a large data, I found a fast way to do this. Assume your data is already in np.array format, named as a.

from sklearn.metrics.pairwise import euclidean_distances
dist = euclidean_distances(a, a)

Below is an experiment to compare the time needed for two approaches:

a = np.random.rand(1000,1000)
import time 
time1 = time.time()
distances = pdist(a, metric='euclidean')
dist_matrix = squareform(distances)
time2 = time.time()
time2 - time1  #0.3639109134674072

time1 = time.time()
dist = euclidean_distances(a, a)
time2 = time.time()
time2-time1  #0.08735871315002441

answered Oct 05 '22 19:10

Michelle Owen

Related questions
                            
                                Regular expression: Match everything after a particular word
                            
                                Convert binary string to bytearray in Python 3
                            
                                Color matplotlib bar chart based on value
                            
                                How can I get current base URI in flask? [duplicate]
                            
                                Fastest way to store a numpy array in redis
                            
                                converting from .py to .ipynb
                            
                                Python multiple threads accessing same file
                            
                                How should I write very long lines of code?
                            
                                python copy files to a network location on Windows without mapping a drive
                            
                                How to check for presence of a layer in a scapy packet?
                            
                                How to get coordinates of address from Python
                            
                                Render an xml to a view
                            
                                Python 2.7 argparse
                            
                                How to split a string on whitespace and retain offsets and lengths of words
                            
                                Convert float to comma-separated string
                            
                                Create list of object attributes in python
                            
                                Triangle wave shaped array in Python
                            
                                Using Pre_delete Signal in django
                            
                                Reading a line of integers in Python [duplicate]
                            
                                Error: No Commands supplied when trying to install pyglet

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Distance calculation between rows in Pandas Dataframe using a distance matrix

Tags:

python

pandas

matrix

euclidean-distance

time-series