Pairwise haversine distance calculation

Tags:

I have two arrays with lat and long. I want to calculate distance between every pair of lat and long with every other pair of lat and long in the array. Here are my two arrays.

Click to copy

lat_array

array([ 0.33356456,  0.33355585,  0.33355585,  0.33401788,  0.33370132,
        0.33370132,  0.33370132,  0.33371075,  0.33371075,  0.33370132,
        0.33370132,  0.33370132,  0.33356488,  0.33356488,  0.33370132,
        0.33370132,  0.33370132,  0.33401788,  0.33362632,  0.33362632,
        0.33364007,  0.33370132,  0.33401788,  0.33401788,  0.33358399,
        0.33358399,  0.33358399,  0.33370132,  0.33370132,  0.33362632,
        0.33370132,  0.33370132,  0.33370132,  0.33370132,  0.33370132,
        0.33356488,  0.33356456,  0.33391071,  0.33370132,  0.33356488,
        0.33356488,  0.33356456,  0.33356456,  0.33356456,  0.33362632,
        0.33364804,  0.3336314 ,  0.33370132,  0.33370132,  0.33370132,
        0.33364034,  0.33359921,  0.33370132,  0.33360397,  0.33348863,
        0.33370132])
long_array

array([ 1.27253229,  1.27249141,  1.27249141,  1.27259085,  1.2724337 ,
        1.2724337 ,  1.2724337 ,  1.27246931,  1.27246931,  1.2724337 ,
        1.2724337 ,  1.2724337 ,  1.27254305,  1.27254305,  1.2724337 ,
        1.2724337 ,  1.2724337 ,  1.27259085,  1.27250461,  1.27250461,
        1.27251211,  1.2724337 ,  1.27259085,  1.27259085,  1.27252134,
        1.27252134,  1.27252134,  1.2724337 ,  1.2724337 ,  1.27250461,
        1.2724337 ,  1.2724337 ,  1.2724337 ,  1.2724337 ,  1.2724337 ,
        1.27254305,  1.27253229,  1.27266808,  1.2724337 ,  1.27254305,
        1.27254305,  1.27253229,  1.27253229,  1.27253229,  1.27250461,
        1.27250534,  1.27250184,  1.2724337 ,  1.2724337 ,  1.2724337 ,
        1.27251339,  1.27223739,  1.2724337 ,  1.2722575 ,  1.27237575,
        1.2724337 ])

After conversion into radians. Now I want distance between first pair of lat and long with remaining pairs of lat and long and so on. And want to print the pairs and the corresponding distance.

This is what I am doing in python.

Click to copy

distance = []
R = 6371.0

for i in range(len(lat_array)):
   for j in (i+1,len(lat_array)):
      dlon = long_array[j]-long_array[i]
      dlat = lat_array[j]-lat_array[i]
      a = sin(dlat / 2)**2 + cos(lat_array[i]) * cos(lat_array[j]) *     
          sin(dlon / 2)**2
      c = 2 * atan2(sqrt(a), sqrt(1 - a))

      distance.append(R * c)

It gives me an error IndexError: index 56 is out of bounds for axis 0 with size 56 Where I am doing it wrong? And how to make calculation faster if the array is big? Please help.

623

asked Jan 01 '16 16:01

Neil

2 Answers

Since this is currently Google's top result for "pairwise haversine distance" I'll add my two cents: This problem can be solved very quickly if you have access to scikit-learn. When looking at sklearn.metrics.pairwise_distances you'll note that the 'haversine' metric is not supported, however it is implemented in sklearn.neighbors.DistanceMetric.

This means you can do the following:

Click to copy

from sklearn.neighbors import DistanceMetric

def sklearn_haversine(lat, lon):
    haversine = DistanceMetric.get_metric('haversine')
    latlon = np.hstack((lat[:, np.newaxis], lon[:, np.newaxis]))
    dists = haversine.pairwise(latlon)
    return 6371 * dists

Note that the concatenation of lat and lon is only necessary since they are seperate arrays. If you would pass them as a combined array of shape (n_samples, 2) you could call haversine.pairwise directly on them. Furthermore multiplication by 6371 is also only necessary if you need the distances in kilometers. E.g. this step wouldn't be necessary if you wanted to simply find the closest pair of points.

Verification:

Click to copy

In [87]: lat = np.array([ 0.33356456,  0.33355585,  0.33355585,  0.33401788,  0.33370132])

In [88]: lng = np.array([ 1.27253229,  1.27249141,  1.27249141,  1.27259085,  1.2724337 ])

In [89]: sklearn_haversine(lat, lng)
Out[89]:
array([[ 0.        ,  0.25227021,  0.25227021,  2.90953323,  1.05422047],
       [ 0.25227021,  0.        ,  0.        ,  3.00383463,  0.98975923],
       [ 0.25227021,  0.        ,  0.        ,  3.00383463,  0.98975923],
       [ 2.90953323,  3.00383463,  3.00383463,  0.        ,  2.2276139 ],
       [ 1.05422047,  0.98975923,  0.98975923,  2.2276139 ,  0.        ]])

Performance:

Click to copy

In [91]: lat = np.random.randn(1000)

In [92]: lng = np.random.randn(1000)

In [93]: %timeit original_app(lat,lng)
1 loops, best of 3: 1.46 s per loop

In [94]: %timeit vectorized_app1(lat,lng)
10 loops, best of 3: 86.7 ms per loop

In [95]: %timeit vectorized_app2(lat,lng)
10 loops, best of 3: 75.7 ms per loop

In [96]: %timeit sklearn_haversine(lat,lng)
10 loops, best of 3: 76 ms per loop

In conclusion you can get the output of Divakar's vectorized_app1 with the speed of vectorized_app2 in shorter and simpler code.

answered Sep 18 '22 16:09

jdoerrie

Assuming lat and lng as the lattitudes & longitudes arrays and that those have data in radians, here's one vectorized solution based upon this other solution -

Click to copy

# Elementwise differentiations for lattitudes & longitudes
dflat = lat[:,None] - lat
dflng = lng[:,None] - lng

# Finally Calculate haversine using its distance formula
d = np.sin(dflat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(dflng/2)**2
hav_dists = 2 * 6371 * np.arcsin(np.sqrt(d))

Now, the above approach would give us output for all pairs irrespective of their order. Thus, we would have two distance outputs for the two pairs :(point1,point2) & (point2,point1), even though the distances would be the same. So, to save on memory and hopefully better performance, you can create unique paired IDs with np.triu_indices and modify the earlier listed approach, like so -

Click to copy

# Elementwise differentiations for lattitudes & longitudes, 
# but not repeat for the same paired elements
N = lat.size
idx1,idx2 = np.triu_indices(N,1)
dflat = lat[idx2] - lat[idx1]
dflng = lng[idx2] - lng[idx1]

# Finally Calculate haversine using its distance formula
d = np.sin(dflat/2)**2 + np.cos(lat[idx2])*np.cos(lat[idx1]) * np.sin(dflng/2)**2
hav_dists = 2 * 6371 * np.arcsin(np.sqrt(d))

Function definitions -

Click to copy

def original_app(lat,lng):
    distance = []
    R = 6371.0
    for i in range(len(lat)):
       for j in range(i+1,len(lat)):
          dlon = lng[j]-lng[i]
          dlat = lat[j]-lat[i]
          a = sin(dlat / 2)**2 + cos(lat[i]) * cos(lat[j]) * sin(dlon / 2)**2
          c = 2 * atan2(sqrt(a), sqrt(1 - a))
          distance.append(R * c)
    return distance

def vectorized_app1(lat,lng):                               
    dflat = lat[:,None] - lat
    dflng = lng[:,None] - lng
    d = np.sin(dflat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(dflng/2)**2
    return 2 * 6371 * np.arcsin(np.sqrt(d))

def vectorized_app2(lat,lng):                               
    N = lat.size
    idx1,idx2 = np.triu_indices(N,1)
    dflat = lat[idx2] - lat[idx1]
    dflng = lng[idx2] - lng[idx1]
    d =np.sin(dflat/2)**2+np.cos(lat[idx2])*np.cos(lat[idx1])*np.sin(dflng/2)**2
    return  2 * 6371 * np.arcsin(np.sqrt(d))

Verify output -

Click to copy

In [78]: lat
Out[78]: array([ 0.33356456,  0.33355585,  0.33355585,  0.33401788,  0.33370132])

In [79]: lng
Out[79]: array([ 1.27253229,  1.27249141,  1.27249141,  1.27259085,  1.2724337 ])

In [80]: original_app(lat,lng)
Out[80]: 
[0.2522702110418014,
 0.2522702110418014,
 2.909533226553249,
 1.0542204712876762,
 0.0,
 3.003834632906676,
 0.9897592295963831,
 3.003834632906676,
 0.9897592295963831,
 2.2276138997714474]

In [81]: vectorized_app1(lat,lng)
Out[81]: 
array([[ 0.        ,  0.25227021,  0.25227021,  2.90953323,  1.05422047],
       [ 0.25227021,  0.        ,  0.        ,  3.00383463,  0.98975923],
       [ 0.25227021,  0.        ,  0.        ,  3.00383463,  0.98975923],
       [ 2.90953323,  3.00383463,  3.00383463,  0.        ,  2.2276139 ],
       [ 1.05422047,  0.98975923,  0.98975923,  2.2276139 ,  0.        ]])

In [82]: vectorized_app2(lat,lng)
Out[82]: 
array([ 0.25227021,  0.25227021,  2.90953323,  1.05422047,  0.        ,
        3.00383463,  0.98975923,  3.00383463,  0.98975923,  2.2276139 ])

Runtime test -

Click to copy

In [83]: lat = np.random.randn(1000)

In [84]: lng = np.random.randn(1000)

In [85]: %timeit original_app(lat,lng)
1 loops, best of 3: 2.11 s per loop

In [86]: %timeit vectorized_app1(lat,lng)
1 loops, best of 3: 263 ms per loop

In [87]: %timeit vectorized_app2(lat,lng)
1 loops, best of 3: 224 ms per loop

Thus, for performance it seems vectorized_app2 might be the way to go!

answered Sep 19 '22 16:09

Divakar

Related questions
                            
                                Changing time from epoch time to iso format in Python [duplicate]
                            
                                How do I inform PyCharm of the location of custom modules?
                            
                                How do I make a space between words when writing to a text file in python
                            
                                Log stack trace for python warning
                            
                                Finding location in code for numpy RuntimeWarning
                            
                                Django can't find template directory
                            
                                python list comprehension (if, continue, break)
                            
                                Django "TemplateDoesNotExist " Error but "Using loader django.template.loaders.app_directories.Loader" File Exists
                            
                                How to correctly break a long line in Python?
                            
                                Python Ranking Dictionary Return Rank
                            
                                using a conditional and lambda in map
                            
                                Python Pandas: Convert nested dictionary to dataframe
                            
                                Convert Two column data frame to occurrence matrix in pandas
                            
                                Change Icon For Tkinter Messagebox
                            
                                Why there is need to push django migrations to version control system
                            
                                Python share values
                            
                                Using OpenWeatherMap API gives 401 error
                            
                                How to sum negative and positive values separately when using groupby in pandas?
                            
                                Python PIL - Finding Nearest Color (Rounding Colors)
                            
                                Finding highest values in each row in a data frame for python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pairwise haversine distance calculation

Tags:

performance

python

arrays

haversine

numpy

Neil

People also ask

2 Answers

jdoerrie

Divakar

Recent Activity

Donate For Us