i have a dataframe:
routeId latitude_value longitude_value
r1 28.210216 22.813209
r2 28.216103 22.496735
r3 28.161786 22.842318
r4 28.093110 22.807081
r5 28.220370 22.503500
r6 28.220370 22.503500
r7 28.220370 22.503500
from this i want to generate a dataframe df2 something like this:
routeId nearest
r1 r3 (for example)
r2 ... similarly for all the routes.
The logic i am trying to implement is
for every route, i should find the euclidean distance of all other routes. and iterating it on routeId.
There is a function for calculating euclidean distance.
dist = math.hypot(x2 - x1, y2 - y1)
But i am confused on how to build a function where i would pass a dataframe, or use .apply()
def get_nearest_route():
.....
return df2
We can use scipy.spatial.distance.cdist
or multiple for loops then replace min with routes and find the closest i.e
mat = scipy.spatial.distance.cdist(df[['latitude_value','longitude_value']],
df[['latitude_value','longitude_value']], metric='euclidean')
# If you dont want scipy, you can use plain python like
# import math
# mat = []
# for i,j in zip(df['latitude_value'],df['longitude_value']):
# k = []
# for l,m in zip(df['latitude_value'],df['longitude_value']):
# k.append(math.hypot(i - l, j - m))
# mat.append(k)
# mat = np.array(mat)
new_df = pd.DataFrame(mat, index=df['routeId'], columns=df['routeId'])
Output of new_df
routeId r1 r2 r3 r4 r5 r6 r7
routeId
r1 0.000000 0.316529 0.056505 0.117266 0.309875 0.309875 0.309875
r2 0.316529 0.000000 0.349826 0.333829 0.007998 0.007998 0.007998
r3 0.056505 0.349826 0.000000 0.077188 0.343845 0.343845 0.343845
r4 0.117266 0.333829 0.077188 0.000000 0.329176 0.329176 0.329176
r5 0.309875 0.007998 0.343845 0.329176 0.000000 0.000000 0.000000
r6 0.309875 0.007998 0.343845 0.329176 0.000000 0.000000 0.000000
r7 0.309875 0.007998 0.343845 0.329176 0.000000 0.000000 0.000000
#Replace minimum distance with column name and not the minimum with `False`.
# new_df[new_df != 0].min(),0). This gives a mask matching minimum other than zero.
closest = np.where(new_df.eq(new_df[new_df != 0].min(),0),new_df.columns,False)
# Remove false from the array and get the column names as list .
df['close'] = [i[i.astype(bool)].tolist() for i in closest]
routeId latitude_value longitude_value close
0 r1 28.210216 22.813209 [r3]
1 r2 28.216103 22.496735 [r5, r6, r7]
2 r3 28.161786 22.842318 [r1]
3 r4 28.093110 22.807081 [r3]
4 r5 28.220370 22.503500 [r2]
5 r6 28.220370 22.503500 [r2]
6 r7 28.220370 22.503500 [r2]
If you dont want to ignore zero then
# Store the array values in a variable
arr = new_df.values
# We dont want to find mimimum to be same point, so replace diagonal by nan
arr[np.diag_indices_from(new_df)] = np.nan
# Replace the non nan min with column name and otherwise with false
new_close = np.where(arr == np.nanmin(arr, axis=1)[:,None],new_df.columns,False)
# Get column names ignoring false.
df['close'] = [i[i.astype(bool)].tolist() for i in new_close]
routeId latitude_value longitude_value close
0 r1 28.210216 22.813209 [r3]
1 r2 28.216103 22.496735 [r5, r6, r7]
2 r3 28.161786 22.842318 [r1]
3 r4 28.093110 22.807081 [r3]
4 r5 28.220370 22.503500 [r6, r7]
5 r6 28.220370 22.503500 [r5, r7]
6 r7 28.220370 22.503500 [r5, r6]
I advise to use pdist function from scipy.spatial.distance.
matrix = scipy.spatial.distance.pdist(df[['latitude_value', 'longitude_value']], metric='euclidean')
Will return you condensed distance matrix of shape (n,) with all pairwise distances calculated.
Then you can use squareform to get square-shaped pairwise distance matrix:
matrix = scipy.spatial.distance.squareform(matrix)
Then for each row matrix[i]
you can find biggest value at index e.g. matrix[i][j]
and you know that for i-th point, its closest point is j-th point.
Can not compare with scipy.spatial.distance.pdist
, but still work '
from itertools import product
import pandas as pd
import math
df['New']=list(zip(df['latitude_value'],df['longitude_value']))
DF=pd.DataFrame(list(product(df.routeId, df.routeId)), columns=['l1', 'l2'])
New=df[['routeId','New']].merge(DF,left_on='routeId',right_on='l1',how='left').merge(df[['routeId','New']],left_on='l2',right_on='routeId')
New['Cal']=New.apply(lambda x : math.hypot(x.New_x[0] - x.New_y[0] , x.New_x[1] - x.New_y[1] ),axis=1)
New=New.loc[New.l1!=New.l2,:]
New.sort_values('Cal').drop_duplicates(['l1'],keep='first')
Out[386]:
routeId_x New_x l1 l2 routeId_y New_y Cal
47 r6 (28.22037, 22.5035) r6 r7 r7 (28.22037, 22.5035) 0.000000
41 r7 (28.22037, 22.5035) r7 r6 r6 (28.22037, 22.5035) 0.000000
39 r5 (28.22037, 22.5035) r5 r6 r6 (28.22037, 22.5035) 0.000000
43 r2 (28.216103, 22.496735) r2 r7 r7 (28.22037, 22.5035) 0.007998
2 r3 (28.161786, 22.842318) r3 r1 r1 (28.210216, 22.813209) 0.056505
14 r1 (28.210216, 22.813209) r1 r3 r3 (28.161786, 22.842318) 0.056505
17 r4 (28.09311, 22.807081) r4 r3 r3 (28.161786, 22.842318) 0.077188
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With