I have a dataframe that contains a list of lat/lon coordinates:
d = {'Provider ID': {0: '10001',
1: '10005',
2: '10006',
3: '10007',
4: '10008',
5: '10011',
6: '10012',
7: '10016',
8: '10018',
9: '10019'},
'latitude': {0: '31.215379379000467',
1: '34.22133455500045',
2: '34.795039606000444',
3: '31.292159523000464',
4: '31.69311635000048',
5: '33.595265517000485',
6: '34.44060759100046',
7: '33.254429322000476',
8: '33.50314015000049',
9: '34.74643089500046'},
'longitude': {0: ' -85.36146587999968',
1: ' -86.15937514799964',
2: ' -87.68507485299966',
3: ' -86.25539902199966',
4: ' -86.26549483099967',
5: ' -86.66531866799966',
6: ' -85.75726760699968',
7: ' -86.81407933399964',
8: ' -86.80242858299965',
9: ' -87.69893502799965'}}
df = pd.DataFrame(d)
My goal is to use the haversine function to figure out the distances between every item in KM:
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
# 6367 km is the radius of the Earth
km = 6367 * c
return km
My goal is to get a dataframe that looks like the result_df below where the values are the distance between each provider id:
result_df = pd.DataFrame(columns = df['Provider ID'], index=df['Provider ID'])
I can do this in a loop, however it's terribly slow. I'm looking for some help in converting this to a vectorized method:
for first_hospital_coordinates in result_df.columns:
for second_hospital_coordinates in result_df['Provider ID']:
if first_hospital_coordinates == 'Provider ID':
pass
else:
L1 = df[df['Provider ID'] == first_hospital_coordinates]['latitude'].astype('float64').values
O1 = df[df['Provider ID'] == first_hospital_coordinates]['longitude'].astype('float64').values
L2 = df[df['Provider ID'] == second_hospital_coordinates]['latitude'].astype('float64').values
O2 = df[df['Provider ID'] == second_hospital_coordinates]['longitude'].astype('float64').values
distance = haversine(O1, L1, O2, L2)
crit = result_df['Provider ID'] == second_hospital_coordinates
result_df.loc[crit, first_hospital_coordinates] = distance
To vectorize this code, you will need to operate on complete dataframe and not on the individual lats and longs. I have made an attempt at this. I need the result df and a new function h2,
import numpy as np
def h2(df, p):
inrad = df.applymap(radians)
dlon = inrad.longitude-inrad.longitude[p]
dlat = inrad.latitude-inrad.latitude[p]
lat1 = pd.Series(index = df.index, data = [df.latitude[p] for i in range(len(df.index))])
a = np.sin(dlat/2)*np.sin(dlat/2) + np.cos(df.latitude) * np.cos(lat1) * np.sin(dlon/2)**2
c = 2 * 1/np.sin(np.sqrt(a))
km = 6367 * c
return km
df = df.set_index('Provider ID')
df = df.astype(float)
df2 = pd.DataFrame(index = df.index, columns = df.index)
for c in df2.columns:
df2[c] = h2(df, c)
print (df2)
This should yield, (I can't be sure if I have the correct answer... my goal was to vectorize the code)
Provider ID 10001 10005 10006 10007 \
Provider ID
10001 inf 5.021936e+05 5.270062e+05 1.649088e+06
10005 5.021936e+05 inf 9.294868e+05 4.985233e+05
10006 5.270062e+05 9.294868e+05 inf 4.548412e+05
10007 1.649088e+06 4.985233e+05 4.548412e+05 inf
10008 1.460299e+06 5.777248e+05 5.246954e+05 3.638231e+06
10011 6.723581e+05 2.004199e+06 1.027439e+06 6.394402e+05
10012 4.559090e+05 3.265536e+06 7.573411e+05 4.694125e+05
10016 7.680036e+05 1.429573e+06 9.105474e+05 7.517467e+05
10018 7.096548e+05 1.733554e+06 1.020976e+06 6.701920e+05
10019 5.436342e+05 9.278739e+05 2.891822e+07 4.638858e+05
Provider ID 10008 10011 10012 10016 \
Provider ID
10001 1.460299e+06 6.723581e+05 4.559090e+05 7.680036e+05
10005 5.777248e+05 2.004199e+06 3.265536e+06 1.429573e+06
10006 5.246954e+05 1.027439e+06 7.573411e+05 9.105474e+05
10007 3.638231e+06 6.394402e+05 4.694125e+05 7.517467e+05
10008 inf 7.766998e+05 5.401081e+05 9.496953e+05
10011 7.766998e+05 inf 1.341775e+06 4.220911e+06
10012 5.401081e+05 1.341775e+06 inf 1.119063e+06
10016 9.496953e+05 4.220911e+06 1.119063e+06 inf
10018 8.236437e+05 1.242451e+07 1.226941e+06 5.866259e+06
10019 5.372119e+05 1.051748e+06 7.514774e+05 9.362341e+05
Provider ID 10018 10019
Provider ID
10001 7.096548e+05 5.436342e+05
10005 1.733554e+06 9.278739e+05
10006 1.020976e+06 2.891822e+07
10007 6.701920e+05 4.638858e+05
10008 8.236437e+05 5.372119e+05
10011 1.242451e+07 1.051748e+06
10012 1.226941e+06 7.514774e+05
10016 5.866259e+06 9.362341e+05
10018 inf 1.048895e+06
10019 1.048895e+06 inf
[10 rows x 10 columns]
You don't need anything fancy, just a few mods to your function.
First, don't use the math
library. If you're doing real math or science, you're likely better off with numpy.
Second, we're going to use the dataframe method apply
. What apply
does is it takes a function and runs every row (axis=1) or column (axis=0) through it, and builds a new pandas object with all of the returned values. So we need to set up haversine
totake row of a dataframe and unpack the values. It becomes:
def haversine(row):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
import numpy as np
# convert all of the row to radians
row = np.radians(row)
# unpack the values for convenience
lat1 = row['lat1']
lat2 = row['lat2']
lon1 = row['lon1']
lon2 = row['lon2']
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
# 6367 km is the radius of the Earth
km = 6367 * c
return km
Ok, now we need to get your dataframe in shape. In your question, everything is a string and that's not good for doing math. So using your variable d
, I said:
df = pandas.DataFrame(d).set_index('Provider ID').astype(float)
So that created the dataframe of strings, set the provider as the index, and then converted all of the columns to floats, since we're doing math.
Now we need to make rows with two sets of coords. For that we'll use the shift
method and join the result to the original dataframe. Doing that all at once looks like this:
df = df.join(df.shift(), lsuffix='1', rsuffix='2')
print(df.head())
lat1 lon1 lat2 lon2
Provider ID
10001 31.215379 -85.361466 NaN NaN
10005 34.221335 -86.159375 31.215379 -85.361466
10006 34.795040 -87.685075 34.221335 -86.159375
10007 31.292160 -86.255399 34.795040 -87.685075
10008 31.693116 -86.265495 31.292160 -86.255399
rsuffix
and lsuffix
are what append "1" and "2" to the column names during the join operation.
The "2" columns are from df.shift()
and you'll notice that they are equal to the "1" columns of the previous row. Also you'll see that the first row of the "2" columns are NaN
since there's nothing previous to the first row.
So now we can apply
the Haversine function:
distance = df.apply(haversine, axis=1)
print(distance)
Provider ID
10001 NaN
10005 342.261590
10006 153.567591
10007 411.393751
10008 44.566642
10011 214.661170
10012 125.775583
10016 163.973219
10018 27.659157
10019 160.901128
dtype: float64
You should be able to operate on the whole thing. Im not very familiar with Pandas so I'll just work with the underlying numpy
arrays. Using your data d
:
df = pd.DataFrame(d)
df1 = df.astype(float)
a = np.radians(df1.values[:,1:])
# a.shape is 10,2, it contains the Lat/Lon only
# transpose and subtract
# add a new axes so they can be broadcast
diff = a[...,np.newaxis] - a.T
# diff.shape is (10,2,10): dLat is diff[:,0,:], dLon is diff[:,1,:]
b = np.square(np.sin(diff / 2))
# b.shape is (10,2,10): sin^2(dLat/2) is b[:,0,:], sin^2(dLon/2) is b[:,1,:]
# make this term: cos(Lat1) * cos(Lat2)
cos_Lat = np.cos(a[:,0])
c = cos_Lat * cos_Lat[:, np.newaxis] # shape 10x10
# sin^2(dLon/2) is b[:,1,:]
b[:,1,:] = b[:,1,:] * c
g = b.sum(axis = 1)
h = 6367000 * 2 * np.arcsin((np.sqrt(g))) # meters
Back to a pandas.DataFrame
df2 = pd.DataFrame(h, index = df['Provider ID'].values, columns = df['Provider ID'].values)
I didn't try any performance tests. There is a lot of intermediate array creation going on and it can be expensive - using the optional output argument of the ufuncs
might alleviate that.
Same thing with in-place operations:
df = pd.DataFrame(d)
df_A = df.astype(float)
z = df_A.values[:,1:]
# cos(Lat1) * cos(Lat2)
w = np.cos(z[:,0])
w = w * w[:, np.newaxis] # w.shape is (10,10)
# sin^2(dLat/2) and sin^2(dLon/2)
np.radians(z, z)
z = z[...,np.newaxis] - z.T
np.divide(z, 2, z)
np.sin(z, z)
np.square(z,z)
# z.shape is now (10,2,10): sin^2(dLat/2) is z[:,0,:], sin^2(dLon/2) is z[:,1,:]
# cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
np.multiply(z[:,1,:], w, z[:,1,:])
# sin^2(dLat/2) + cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
z = z.sum(axis = 1)
np.sqrt(z, z)
np.arcsin(z,z)
np.multiply(z, 6367000 * 2, z) #meters
df_B = pd.DataFrame(z, index = df['Provider ID'].values, columns = df['Provider ID'].values)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With