Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - how to speed up calculation of distances between cities

I have 55249 cities in my database. Every single one has got latitude longitude values. For every city I want to calculate distances to every other city and store those that are no further than 30km. Here is my algorithm:

# distance function
from math import sin, cos, sqrt, atan2, radians

def distance(obj1, obj2):
    lat1 = radians(obj1.latitude)
    lon1 = radians(obj1.longitude)
    lat2 = radians(obj2.latitude)
    lon2 = radians(obj2.longitude)
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = (sin(dlat/2))**2 + cos(lat1) * cos(lat2) * (sin(dlon/2))**2
    c = 2 * atan2(sqrt(a), sqrt(1-a))
    return round(6373.0 * c, 2)

def distances():
    cities = City.objects.all()  # I am using Django ORM
    for city in cities:
        closest = list()
        for tested_city in cities:
            distance = distance(city, tested_city)
            if distance <= 30. and distance != 0.:
                closest.append(tested_city)
        city.closest_cities.add(*closest)  # again, Django thing
        city.save()  # Django

This works but takes awful lot of time. Gonna take weeks to complete. Any way I could speed it up?

like image 788
gwaramadze Avatar asked Dec 18 '13 10:12

gwaramadze


1 Answers

You can't afford to compute the distance between every pair of cities. Instead, you need to put your cities in a space-partitioning data structure for which you can make fast nearest-neighbour queries. SciPy comes with a kd-tree implementation, scipy.spatial.KDTree, that is suitable for this application.

There are two difficulties here. First, scipy.spatial.KDTree uses Euclidean distance between points, but you want to use the great circle distance along the surface of the Earth. Second, longitude wraps around, so that nearest neighbours might have longitudes that differ by 360°. Both problems can be solved if you take the following approach:

  1. Convert your locations from geodetic coordinates (latitude, longitude) to ECEF (Earth-Centered, Earth-Fixed) coordinates (x, y, z).

  2. Put these ECEF coordinates into the scipy.spatial.KDTree.

  3. Convert your great circle distance (for example, 30 km) into a Euclidean distance.

  4. Call scipy.spatial.KDTree.query_ball_point to get the cities within range.

Here's some example code to illustrate this approach. The function geodetic2ecef comes from PySatel by David Parunakian and is licensed under the GPL.

from math import radians, cos, sin, sqrt

# Constants defined by the World Geodetic System 1984 (WGS84)
A = 6378.137
B = 6356.7523142
ESQ = 6.69437999014 * 0.001

def geodetic2ecef(lat, lon, alt=0):
    """Convert geodetic coordinates to ECEF."""
    lat, lon = radians(lat), radians(lon)
    xi = sqrt(1 - ESQ * sin(lat))
    x = (A / xi + alt) * cos(lat) * cos(lon)
    y = (A / xi + alt) * cos(lat) * sin(lon)
    z = (A / xi * (1 - ESQ) + alt) * sin(lat)
    return x, y, z

def euclidean_distance(distance):
    """Return the approximate Euclidean distance corresponding to the
    given great circle distance (in km).

    """
    return 2 * A * sin(distance / (2 * B))

Let's make up fifty thousand random city locations and convert them to ECEF coordinates:

>>> from random import uniform
>>> cities = [(uniform(-90, 90), uniform(0, 360)) for _ in range(50000)]
>>> ecef_cities = [geodetic2ecef(lat, lon) for lat, lon in cities]

Put them into a scipy.spatial.KDTree:

>>> import numpy
>>> from scipy.spatial import KDTree
>>> tree = KDTree(numpy.array(ecef_cities))

Find all cities within about 100 km of London:

>>> london = geodetic2ecef(51, 0)
>>> tree.query_ball_point([london], r=euclidean_distance(100))
array([[37810, 15755, 16276]], dtype=object)

This array contains, for each point that you queried, an array the cities within the distance r. Each neighbour is given as its index in the original array that you passed to KDTree. So there are three cities within about 100 km of London, namely the cities with indexes 37810, 15755, and 16276 in the original list:

>>> from pprint import pprint
>>> pprint([cities[i] for i in [37810, 15755, 16276]])
[(51.7186871990946, 359.8043453670437),
 (50.82734317063884, 1.1422052710187103),
 (50.95466110717763, 0.8956257749604779)]

Notes:

  1. You can see from the example output that neighbours with longitudes that differ by about 360° are correctly discovered.

  2. The approach seems fast enough. Here we find neighbours within 30 km for the first thousand cities, taking about 5 seconds:

    >>> from timeit import timeit
    >>> timeit(lambda:tree.query_ball_point(ecef_cities[:1000], r=euclidean_distance(30)), number=1)
    5.013611573027447
    

    Extrapolating, we expect to find neighbours within 30 km for all 50,000 cities in about four minutes.

  3. My euclidean_distance function overestimates the Euclidean distance corresponding to a given great circle distance (so as not to miss any cities). This might be good enough for some applications—after all, cities are not point objects—but if you need more accuracy than this, then you could filter the resulting points using, say, one of the great circle distance functions from geopy.

like image 173
Gareth Rees Avatar answered Sep 28 '22 00:09

Gareth Rees