Vectorizing a function in pandas

Question

I have a dataframe that contains a list of lat/lon coordinates:

d = {'Provider ID': {0: '10001',
  1: '10005',
  2: '10006',
  3: '10007',
  4: '10008',
  5: '10011',
  6: '10012',
  7: '10016',
  8: '10018',
  9: '10019'},
 'latitude': {0: '31.215379379000467',
  1: '34.22133455500045',
  2: '34.795039606000444',
  3: '31.292159523000464',
  4: '31.69311635000048',
  5: '33.595265517000485',
  6: '34.44060759100046',
  7: '33.254429322000476',
  8: '33.50314015000049',
  9: '34.74643089500046'},
 'longitude': {0: ' -85.36146587999968',
  1: ' -86.15937514799964',
  2: ' -87.68507485299966',
  3: ' -86.25539902199966',
  4: ' -86.26549483099967',
  5: ' -86.66531866799966',
  6: ' -85.75726760699968',
  7: ' -86.81407933399964',
  8: ' -86.80242858299965',
  9: ' -87.69893502799965'}}
df = pd.DataFrame(d)

My goal is to use the haversine function to figure out the distances between every item in KM:

from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """

    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 

    # 6367 km is the radius of the Earth
    km = 6367 * c
    return km

My goal is to get a dataframe that looks like the result_df below where the values are the distance between each provider id:

 result_df = pd.DataFrame(columns = df['Provider ID'], index=df['Provider ID'])

I can do this in a loop, however it's terribly slow. I'm looking for some help in converting this to a vectorized method:

for first_hospital_coordinates in result_df.columns:
    for second_hospital_coordinates in result_df['Provider ID']:
        if first_hospital_coordinates == 'Provider ID':
            pass
        else:
            L1 = df[df['Provider ID'] == first_hospital_coordinates]['latitude'].astype('float64').values
            O1 = df[df['Provider ID'] == first_hospital_coordinates]['longitude'].astype('float64').values
            L2 = df[df['Provider ID'] == second_hospital_coordinates]['latitude'].astype('float64').values
            O2 = df[df['Provider ID'] == second_hospital_coordinates]['longitude'].astype('float64').values

            distance = haversine(O1, L1, O2, L2)

            crit = result_df['Provider ID'] == second_hospital_coordinates
            result_df.loc[crit, first_hospital_coordinates] = distance

nitin · Accepted Answer

To vectorize this code, you will need to operate on complete dataframe and not on the individual lats and longs. I have made an attempt at this. I need the result df and a new function h2,

import numpy as np
def h2(df, p):
    inrad = df.applymap(radians)
    dlon = inrad.longitude-inrad.longitude[p]
    dlat = inrad.latitude-inrad.latitude[p]
    lat1 = pd.Series(index = df.index, data = [df.latitude[p] for i in range(len(df.index))])
    a = np.sin(dlat/2)*np.sin(dlat/2) + np.cos(df.latitude) * np.cos(lat1) * np.sin(dlon/2)**2
    c = 2 * 1/np.sin(np.sqrt(a))
    km = 6367 * c
    return km

df = df.set_index('Provider ID')
df = df.astype(float)
df2 = pd.DataFrame(index = df.index, columns = df.index)
for c in df2.columns:
    df2[c] = h2(df, c)

print (df2)

This should yield, (I can't be sure if I have the correct answer... my goal was to vectorize the code)

Provider ID         10001         10005         10006         10007  \
Provider ID                                                           
10001                 inf  5.021936e+05  5.270062e+05  1.649088e+06   
10005        5.021936e+05           inf  9.294868e+05  4.985233e+05   
10006        5.270062e+05  9.294868e+05           inf  4.548412e+05   
10007        1.649088e+06  4.985233e+05  4.548412e+05           inf   
10008        1.460299e+06  5.777248e+05  5.246954e+05  3.638231e+06   
10011        6.723581e+05  2.004199e+06  1.027439e+06  6.394402e+05   
10012        4.559090e+05  3.265536e+06  7.573411e+05  4.694125e+05   
10016        7.680036e+05  1.429573e+06  9.105474e+05  7.517467e+05   
10018        7.096548e+05  1.733554e+06  1.020976e+06  6.701920e+05   
10019        5.436342e+05  9.278739e+05  2.891822e+07  4.638858e+05   

Provider ID         10008         10011         10012         10016  \
Provider ID                                                           
10001        1.460299e+06  6.723581e+05  4.559090e+05  7.680036e+05   
10005        5.777248e+05  2.004199e+06  3.265536e+06  1.429573e+06   
10006        5.246954e+05  1.027439e+06  7.573411e+05  9.105474e+05   
10007        3.638231e+06  6.394402e+05  4.694125e+05  7.517467e+05   
10008                 inf  7.766998e+05  5.401081e+05  9.496953e+05   
10011        7.766998e+05           inf  1.341775e+06  4.220911e+06   
10012        5.401081e+05  1.341775e+06           inf  1.119063e+06   
10016        9.496953e+05  4.220911e+06  1.119063e+06           inf   
10018        8.236437e+05  1.242451e+07  1.226941e+06  5.866259e+06   
10019        5.372119e+05  1.051748e+06  7.514774e+05  9.362341e+05   

Provider ID         10018         10019  
Provider ID                              
10001        7.096548e+05  5.436342e+05  
10005        1.733554e+06  9.278739e+05  
10006        1.020976e+06  2.891822e+07  
10007        6.701920e+05  4.638858e+05  
10008        8.236437e+05  5.372119e+05  
10011        1.242451e+07  1.051748e+06  
10012        1.226941e+06  7.514774e+05  
10016        5.866259e+06  9.362341e+05  
10018                 inf  1.048895e+06  
10019        1.048895e+06           inf  

[10 rows x 10 columns]

Paul H · Answer

You don't need anything fancy, just a few mods to your function.

First, don't use the math library. If you're doing real math or science, you're likely better off with numpy.

Second, we're going to use the dataframe method apply. What apply does is it takes a function and runs every row (axis=1) or column (axis=0) through it, and builds a new pandas object with all of the returned values. So we need to set up haversine totake row of a dataframe and unpack the values. It becomes:

def haversine(row):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    import numpy as np

    # convert all of the row to radians
    row = np.radians(row)

    # unpack the values for convenience
    lat1 = row['lat1']
    lat2 = row['lat2']
    lon1 = row['lon1']
    lon2 = row['lon2']

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 

    # 6367 km is the radius of the Earth
    km = 6367 * c
    return km

Ok, now we need to get your dataframe in shape. In your question, everything is a string and that's not good for doing math. So using your variable d, I said:

df = pandas.DataFrame(d).set_index('Provider ID').astype(float)

So that created the dataframe of strings, set the provider as the index, and then converted all of the columns to floats, since we're doing math.

Now we need to make rows with two sets of coords. For that we'll use the shift method and join the result to the original dataframe. Doing that all at once looks like this:

df = df.join(df.shift(), lsuffix='1', rsuffix='2')
print(df.head())

                  lat1       lon1       lat2       lon2
Provider ID                                            
10001        31.215379 -85.361466        NaN        NaN
10005        34.221335 -86.159375  31.215379 -85.361466
10006        34.795040 -87.685075  34.221335 -86.159375
10007        31.292160 -86.255399  34.795040 -87.685075
10008        31.693116 -86.265495  31.292160 -86.255399

rsuffix and lsuffix are what append "1" and "2" to the column names during the join operation.

The "2" columns are from df.shift() and you'll notice that they are equal to the "1" columns of the previous row. Also you'll see that the first row of the "2" columns are NaN since there's nothing previous to the first row.

So now we can apply the Haversine function:

distance = df.apply(haversine, axis=1)
print(distance)
Provider ID
10001                 NaN
10005          342.261590
10006          153.567591
10007          411.393751
10008           44.566642
10011          214.661170
10012          125.775583
10016          163.973219
10018           27.659157
10019          160.901128
dtype: float64

wwii · Answer

You should be able to operate on the whole thing. Im not very familiar with Pandas so I'll just work with the underlying numpy arrays. Using your data d:

df = pd.DataFrame(d)
df1 = df.astype(float)
a = np.radians(df1.values[:,1:])
# a.shape is 10,2, it contains the Lat/Lon only

# transpose and subtract
# add a new axes so they can be broadcast
diff = a[...,np.newaxis] - a.T
# diff.shape is (10,2,10): dLat is diff[:,0,:], dLon is diff[:,1,:]

b = np.square(np.sin(diff / 2))
# b.shape is (10,2,10): sin^2(dLat/2) is b[:,0,:], sin^2(dLon/2) is b[:,1,:]

# make this term: cos(Lat1) * cos(Lat2)
cos_Lat = np.cos(a[:,0])
c = cos_Lat * cos_Lat[:, np.newaxis]    # shape 10x10

# sin^2(dLon/2) is b[:,1,:]
b[:,1,:] = b[:,1,:] * c
g = b.sum(axis = 1)
h = 6367000 * 2 * np.arcsin((np.sqrt(g)))   # meters

Back to a pandas.DataFrame

df2 = pd.DataFrame(h, index = df['Provider ID'].values, columns = df['Provider ID'].values)

I didn't try any performance tests. There is a lot of intermediate array creation going on and it can be expensive - using the optional output argument of the ufuncs might alleviate that.

Same thing with in-place operations:

df = pd.DataFrame(d)
df_A = df.astype(float)
z = df_A.values[:,1:]

# cos(Lat1) * cos(Lat2)
w = np.cos(z[:,0])
w = w * w[:, np.newaxis]    # w.shape is (10,10)

# sin^2(dLat/2) and sin^2(dLon/2)
np.radians(z, z)
z = z[...,np.newaxis] - z.T
np.divide(z, 2, z)
np.sin(z, z)
np.square(z,z)
# z.shape is now (10,2,10): sin^2(dLat/2) is z[:,0,:], sin^2(dLon/2) is z[:,1,:]

# cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
np.multiply(z[:,1,:], w, z[:,1,:])

# sin^2(dLat/2) + cos(Lat1) * cos(Lat2) * sin^2(dLon/2)
z = z.sum(axis = 1)

np.sqrt(z, z)
np.arcsin(z,z)
np.multiply(z, 6367000 * 2, z)   #meters

df_B = pd.DataFrame(z, index = df['Provider ID'].values, columns = df['Provider ID'].values)

Vectorizing a function in pandas

Tags:

python

pandas

dataframe

DataSwede

3 Answers

nitin

Paul H

wwii

Recent Activity

Donate For Us

Vectorizing a function in pandas

Tags:

python

pandas

dataframe

DataSwede

3 Answers

nitin

Paul H

wwii

Related questions

Recent Activity

Donate For Us