Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Euclidean Distance Matrix Using Pandas

I have a .csv file that contains city, latitude and longitude data in the below format:

CITY|LATITUDE|LONGITUDE
A|40.745392|-73.978364
B|42.562786|-114.460503
C|37.227928|-77.401924
D|41.245708|-75.881241
E|41.308273|-72.927887

I need to create a distance matrix in the below format (please ignore the dummy values):

         A         B         C         D         E   
A  0.000000  6.000000  5.744563  6.082763  5.656854  
B  6.000000  0.000000  6.082763  5.385165  5.477226  
C  1.744563  6.082763  0.000000  6.000000  5.385165
D  6.082763  5.385165  6.000000  0.000000  5.385165  
E  5.656854  5.477226  5.385165  5.385165  0.000000  

I have loaded the data into a pandas dataframe and have created a cross join as below:

import pandas as pd
df_A = pd.read_csv('lat_lon.csv', delimiter='|', encoding="utf-8-sig")
df_B = df_A
df_A['key'] = 1
df_B['key'] = 1 
df_C = pd.merge(df_A, df_B, on='key')  
  • Can you please help me create the above matrix structure?
  • Also, is it possible to avoid step involving cross join?
like image 272
Abacus Avatar asked Aug 29 '16 10:08

Abacus


People also ask

How do you find the Euclidean distance of a matrix?

The Euclidean distance is simply the square root of the squared differences between corresponding elements of the rows (or columns). This is probably the most commonly used distance metric.


3 Answers

You can use pdist and squareform methods from scipy.spatial.distance:

In [12]: df
Out[12]:
  CITY   LATITUDE   LONGITUDE
0    A  40.745392  -73.978364
1    B  42.562786 -114.460503
2    C  37.227928  -77.401924
3    D  41.245708  -75.881241
4    E  41.308273  -72.927887

In [13]: from scipy.spatial.distance import squareform, pdist

In [14]: pd.DataFrame(squareform(pdist(df.iloc[:, 1:])), columns=df.CITY.unique(), index=df.CITY.unique())
Out[14]:
           A          B          C          D          E
A   0.000000  40.522913   4.908494   1.967551   1.191779
B  40.522913   0.000000  37.440606  38.601738  41.551558
C   4.908494  37.440606   0.000000   4.295932   6.055264
D   1.967551  38.601738   4.295932   0.000000   2.954017
E   1.191779  41.551558   6.055264   2.954017   0.000000
like image 155
MaxU - stop WAR against UA Avatar answered Oct 01 '22 13:10

MaxU - stop WAR against UA


for i in df["CITY"]:
    for j in df["CITY"]:
        row = df[df["CITY"] == j][["LATITUDE", "LONGITUDE"]]
        latitude = row["LATITUDE"].tolist()[0]
        longitude = row["LONGITUDE"].tolist()[0]
        df.loc[df['CITY'] == i, j] = ((df["LATITUDE"] - latitude)**2 + (df["LONGITUDE"] - longitude)**2)**0.5

df = df.drop(["CITY", "LATITUDE", "LONGITUDE"], axis=1)

This works

like image 43
Himaprasoon Avatar answered Oct 01 '22 13:10

Himaprasoon


the matrix can be directly created with cdist in scipy.spatial.distance:

from scipy.spatial.distance import cdist
df_array = df[["LATITUDE", "LONGITUDE"]].to_numpy()
dist_mat = cdist(df_array, df_array)
pd.DataFrame(dist_mat, columns = df["CITY"], index = df["CITY"])
like image 31
simplyPTA Avatar answered Oct 01 '22 11:10

simplyPTA