I have the following Pandas DataFrame:
In [31]:
import pandas as pd
sample = pd.DataFrame({'Sym1': ['a','a','a','d'],'Sym2':['a','c','b','b'],'Sym3':['a','c','b','d'],'Sym4':['b','b','b','a']},index=['Item1','Item2','Item3','Item4'])
In [32]: print(sample)
Out [32]:
Sym1 Sym2 Sym3 Sym4
Item1 a a a b
Item2 a c c b
Item3 a b b b
Item4 d b d a
and I want to find the elegant way to get the distance between each Item
according to this distance matrix:
In [34]:
DistMatrix = pd.DataFrame({'a': [0,0,0.67,1.34],'b':[0,0,0,0.67],'c':[0.67,0,0,0],'d':[1.34,0.67,0,0]},index=['a','b','c','d'])
print(DistMatrix)
Out[34]:
a b c d
a 0.00 0.00 0.67 1.34
b 0.00 0.00 0.00 0.67
c 0.67 0.00 0.00 0.00
d 1.34 0.67 0.00 0.00
For example comparing Item1
to Item2
would compare aaab
-> accb
-- using the distance matrix this would be 0+0.67+0.67+0=1.34
Ideal output:
Item1 Item2 Item3 Item4
Item1 0 1.34 0 2.68
Item2 1.34 0 0 1.34
Item3 0 0 0 2.01
Item4 2.68 1.34 2.01 0
The distance matrix between the shapes, D∈R+N×N, is calculated using the Adjacent Entries Distance between the self functional maps, where N is the number of the shapes in the benchmark (94)Dij=DAE(Ci,Cj)i,j∈{1… N}.
You can convert D into a symmetric matrix by using the squareform function. Z = squareform(D) returns an m-by-m matrix where Z(i,j) corresponds to the pairwise distance between observations i and j.
Distance matrix (DM) refers to a two-dimensional array containing the pairwise distances of a set of elements. DM has a broad range of usage in various scientific research fields. It is used intensively in data clustering [1. C.-E.
This is an old question, but there is a Scipy function that does this:
from scipy.spatial.distance import pdist, squareform
distances = pdist(sample.values, metric='euclidean')
dist_matrix = squareform(distances)
pdist
operates on Numpy matrices, and DataFrame.values
is the underlying Numpy NDarray representation of the data frame. The metric
argument allows you to select one of several built-in distance metrics, or you can pass in any binary function to use a custom distance. It's very powerful and, in my experience, very fast. The result is a "flat" array that consists only of the upper triangle of the distance matrix (because it's symmetric), not including the diagonal (because it's always 0). squareform
then translates this flattened form into a full matrix.
The docs have more info, including a mathematical rundown of the many built-in distance functions.
this is doing twice as much work as needed, but technically works for non-symmetric distance matrices as well ( whatever that is supposed to mean )
pd.DataFrame ( { idx1: { idx2:sum( DistMatrix[ x ][ y ]
for (x, y) in zip( row1, row2 ) )
for (idx2, row2) in sample.iterrows( ) }
for (idx1, row1 ) in sample.iterrows( ) } )
you can make it more readable by writing it in pieces:
# a helper function to compute distance of two items
dist = lambda xs, ys: sum( DistMatrix[ x ][ y ] for ( x, y ) in zip( xs, ys ) )
# a second helper function to compute distances from a given item
xdist = lambda x: { idx: dist( x, y ) for (idx, y) in sample.iterrows( ) }
# the pairwise distance matrix
pd.DataFrame( { idx: xdist( x ) for ( idx, x ) in sample.iterrows( ) } )
For a large data, I found a fast way to do this. Assume your data is already in np.array format, named as a.
from sklearn.metrics.pairwise import euclidean_distances
dist = euclidean_distances(a, a)
Below is an experiment to compare the time needed for two approaches:
a = np.random.rand(1000,1000)
import time
time1 = time.time()
distances = pdist(a, metric='euclidean')
dist_matrix = squareform(distances)
time2 = time.time()
time2 - time1 #0.3639109134674072
time1 = time.time()
dist = euclidean_distances(a, a)
time2 = time.time()
time2-time1 #0.08735871315002441
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With