Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get Nearest Point from each other in pandas dataframe

i have a dataframe:

  routeId  latitude_value  longitude_value
  r1       28.210216        22.813209
  r2       28.216103        22.496735
  r3       28.161786        22.842318
  r4       28.093110        22.807081
  r5       28.220370        22.503500
  r6       28.220370        22.503500
  r7       28.220370        22.503500

from this i want to generate a dataframe df2 something like this:

routeId    nearest
  r1         r3         (for example)
  r2       ...    similarly for all the routes.

The logic i am trying to implement is

for every route, i should find the euclidean distance of all other routes. and iterating it on routeId.

There is a function for calculating euclidean distance.

dist = math.hypot(x2 - x1, y2 - y1)

But i am confused on how to build a function where i would pass a dataframe, or use .apply()

def  get_nearest_route():
    .....
    return df2
like image 471
Shubham R Avatar asked Nov 28 '17 15:11

Shubham R


3 Answers

We can use scipy.spatial.distance.cdist or multiple for loops then replace min with routes and find the closest i.e

mat = scipy.spatial.distance.cdist(df[['latitude_value','longitude_value']], 
                              df[['latitude_value','longitude_value']], metric='euclidean')

# If you dont want scipy, you can use plain python like 
# import math
# mat = []
# for i,j in zip(df['latitude_value'],df['longitude_value']):
#     k = []
#     for l,m in zip(df['latitude_value'],df['longitude_value']):
#         k.append(math.hypot(i - l, j - m))
#     mat.append(k)
# mat = np.array(mat)

new_df = pd.DataFrame(mat, index=df['routeId'], columns=df['routeId']) 

Output of new_df

routeId        r1        r2        r3        r4        r5        r6        r7
routeId                                                                      
r1       0.000000  0.316529  0.056505  0.117266  0.309875  0.309875  0.309875
r2       0.316529  0.000000  0.349826  0.333829  0.007998  0.007998  0.007998
r3       0.056505  0.349826  0.000000  0.077188  0.343845  0.343845  0.343845
r4       0.117266  0.333829  0.077188  0.000000  0.329176  0.329176  0.329176
r5       0.309875  0.007998  0.343845  0.329176  0.000000  0.000000  0.000000
r6       0.309875  0.007998  0.343845  0.329176  0.000000  0.000000  0.000000
r7       0.309875  0.007998  0.343845  0.329176  0.000000  0.000000  0.000000    

#Replace minimum distance with column name and not the minimum with `False`.
# new_df[new_df != 0].min(),0). This gives a mask matching minimum other than zero.  
closest = np.where(new_df.eq(new_df[new_df != 0].min(),0),new_df.columns,False)

# Remove false from the array and get the column names as list . 
df['close'] = [i[i.astype(bool)].tolist() for i in closest]


 routeId  latitude_value  longitude_value         close
0      r1       28.210216        22.813209          [r3]
1      r2       28.216103        22.496735  [r5, r6, r7]
2      r3       28.161786        22.842318          [r1]
3      r4       28.093110        22.807081          [r3]
4      r5       28.220370        22.503500          [r2]
5      r6       28.220370        22.503500          [r2]
6      r7       28.220370        22.503500          [r2] 

If you dont want to ignore zero then

# Store the array values in a variable
arr = new_df.values
# We dont want to find mimimum to be same point, so replace diagonal by nan
arr[np.diag_indices_from(new_df)] = np.nan

# Replace the non nan min with column name and otherwise with false
new_close = np.where(arr == np.nanmin(arr, axis=1)[:,None],new_df.columns,False)

# Get column names ignoring false. 
df['close'] = [i[i.astype(bool)].tolist() for i in new_close]

   routeId  latitude_value  longitude_value         close
0      r1       28.210216        22.813209          [r3]
1      r2       28.216103        22.496735  [r5, r6, r7]
2      r3       28.161786        22.842318          [r1]
3      r4       28.093110        22.807081          [r3]
4      r5       28.220370        22.503500      [r6, r7]
5      r6       28.220370        22.503500      [r5, r7]
6      r7       28.220370        22.503500      [r5, r6]
like image 178
Bharath Avatar answered Oct 28 '22 04:10

Bharath


I advise to use pdist function from scipy.spatial.distance.

matrix = scipy.spatial.distance.pdist(df[['latitude_value', 'longitude_value']], metric='euclidean')

Will return you condensed distance matrix of shape (n,) with all pairwise distances calculated.

Then you can use squareform to get square-shaped pairwise distance matrix:

matrix = scipy.spatial.distance.squareform(matrix)

Then for each row matrix[i] you can find biggest value at index e.g. matrix[i][j] and you know that for i-th point, its closest point is j-th point.

like image 21
jo9k Avatar answered Oct 28 '22 03:10

jo9k


Can not compare with scipy.spatial.distance.pdist, but still work '

from itertools import product
import pandas as pd
import math
df['New']=list(zip(df['latitude_value'],df['longitude_value']))
DF=pd.DataFrame(list(product(df.routeId, df.routeId)), columns=['l1', 'l2'])
New=df[['routeId','New']].merge(DF,left_on='routeId',right_on='l1',how='left').merge(df[['routeId','New']],left_on='l2',right_on='routeId')
New['Cal']=New.apply(lambda x : math.hypot(x.New_x[0] - x.New_y[0] , x.New_x[1]  - x.New_y[1] ),axis=1)
New=New.loc[New.l1!=New.l2,:]
New.sort_values('Cal').drop_duplicates(['l1'],keep='first')


Out[386]: 
   routeId_x                   New_x  l1  l2 routeId_y                   New_y       Cal
47        r6     (28.22037, 22.5035)  r6  r7        r7     (28.22037, 22.5035)  0.000000
41        r7     (28.22037, 22.5035)  r7  r6        r6     (28.22037, 22.5035)  0.000000
39        r5     (28.22037, 22.5035)  r5  r6        r6     (28.22037, 22.5035)  0.000000
43        r2  (28.216103, 22.496735)  r2  r7        r7     (28.22037, 22.5035)  0.007998
2         r3  (28.161786, 22.842318)  r3  r1        r1  (28.210216, 22.813209)  0.056505
14        r1  (28.210216, 22.813209)  r1  r3        r3  (28.161786, 22.842318)  0.056505
17        r4   (28.09311, 22.807081)  r4  r3        r3  (28.161786, 22.842318)  0.077188
like image 2
BENY Avatar answered Oct 28 '22 03:10

BENY