I have 2 sets of geo-codes as pandas series and I am trying to find the fastest way to get the minimum euclidean distance of points in set A from points in set B. That is: the closest point to 40.748043 & -73.992953 from the second set,and so on. Would really appreciate any suggestions/help.
Set A:
print(latitude1)
print(longitude1)
0 40.748043
1 42.361016
Name: latitude, dtype: float64
0 -73.992953
1 -71.020005
Name: longitude, dtype: float64
Set B:
print(latitude2)
print(longitude2)
0 42.50729
1 42.50779
2 25.56473
3 25.78953
4 25.33132
5 25.06570
6 25.59246
7 25.61955
8 25.33737
9 24.11028
Name: latitude, dtype: float64
0 1.53414
1 1.52109
2 55.55517
3 55.94320
4 56.34199
5 55.17128
6 56.26176
7 56.27291
8 55.41206
9 52.73056
Name: longitude, dtype: float64
vectorize() method. The numpy. vectorize() function maps functions on data structures that contain a sequence of objects like NumPy arrays. The nested sequence of objects or NumPy arrays as inputs and returns a single NumPy array or a tuple of NumPy arrays.
We can find the nearest value in the list by using the min() function. Define a function that calculates the difference between a value in the list and the given value and returns the absolute value of the result. Then call the min() function which returns the closest value to the given value.
asarray() function is used when we want to convert input to an array. Input can be lists, lists of tuples, tuples, tuples of tuples, tuples of lists and arrays. Syntax : numpy.asarray(arr, dtype=None, order=None)
This is one way using just numpy.linalg.norm
.
import pandas as pd, numpy as np
df1['coords1'] = list(zip(df1['latitude1'], df1['longitude1']))
df2['coords2'] = list(zip(df2['latitude2'], df2['longitude2']))
def calc_min(x):
amin = np.argmin([np.linalg.norm(np.array(x)-np.array(y)) for y in df2['coords2']])
return df2['coords2'].iloc[amin]
df1['closest'] = df1['coords1'].map(calc_min)
# latitude1 longitude1 coords1 closest
# 0 40.748043 -73.992953 (40.748043, -73.992953) (42.50779, 1.52109)
# 1 42.361016 -71.020005 (42.361016, -71.020005) (42.50779, 1.52109)
# 2 25.361016 54.000000 (25.361016, 54.0) (25.0657, 55.17128)
Setup
from io import StringIO
mystr1 = """latitude1|longitude1
40.748043|-73.992953
42.361016|-71.020005
25.361016|54.0000
"""
mystr2 = """latitude2|longitude2
42.50729|1.53414
42.50779|1.52109
25.56473|55.55517
25.78953|55.94320
25.33132|56.34199
25.06570|55.17128
25.59246|56.26176
25.61955|56.27291
25.33737|55.41206
24.11028|52.73056"""
df1 = pd.read_csv(StringIO(mystr1), sep='|')
df2 = pd.read_csv(StringIO(mystr2), sep='|')
If performance is an issue, you can vectorize this calculation fairly easily via the underlying numpy arrays.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With