I have 55249 cities in my database. Every single one has got latitude longitude values. For every city I want to calculate distances to every other city and store those that are no further than 30km. Here is my algorithm:
# distance function
from math import sin, cos, sqrt, atan2, radians
def distance(obj1, obj2):
lat1 = radians(obj1.latitude)
lon1 = radians(obj1.longitude)
lat2 = radians(obj2.latitude)
lon2 = radians(obj2.longitude)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = (sin(dlat/2))**2 + cos(lat1) * cos(lat2) * (sin(dlon/2))**2
c = 2 * atan2(sqrt(a), sqrt(1-a))
return round(6373.0 * c, 2)
def distances():
cities = City.objects.all() # I am using Django ORM
for city in cities:
closest = list()
for tested_city in cities:
distance = distance(city, tested_city)
if distance <= 30. and distance != 0.:
closest.append(tested_city)
city.closest_cities.add(*closest) # again, Django thing
city.save() # Django
This works but takes awful lot of time. Gonna take weeks to complete. Any way I could speed it up?
You can't afford to compute the distance between every pair of cities. Instead, you need to put your cities in a space-partitioning data structure for which you can make fast nearest-neighbour queries. SciPy comes with a kd-tree implementation, scipy.spatial.KDTree
, that is suitable for this application.
There are two difficulties here. First, scipy.spatial.KDTree
uses Euclidean distance between points, but you want to use the great circle distance along the surface of the Earth. Second, longitude wraps around, so that nearest neighbours might have longitudes that differ by 360°. Both problems can be solved if you take the following approach:
Convert your locations from geodetic coordinates (latitude, longitude) to ECEF (Earth-Centered, Earth-Fixed) coordinates (x, y, z).
Put these ECEF coordinates into the scipy.spatial.KDTree
.
Convert your great circle distance (for example, 30 km) into a Euclidean distance.
Call scipy.spatial.KDTree.query_ball_point
to get the cities within range.
Here's some example code to illustrate this approach. The function geodetic2ecef
comes from PySatel by David Parunakian and is licensed under the GPL.
from math import radians, cos, sin, sqrt
# Constants defined by the World Geodetic System 1984 (WGS84)
A = 6378.137
B = 6356.7523142
ESQ = 6.69437999014 * 0.001
def geodetic2ecef(lat, lon, alt=0):
"""Convert geodetic coordinates to ECEF."""
lat, lon = radians(lat), radians(lon)
xi = sqrt(1 - ESQ * sin(lat))
x = (A / xi + alt) * cos(lat) * cos(lon)
y = (A / xi + alt) * cos(lat) * sin(lon)
z = (A / xi * (1 - ESQ) + alt) * sin(lat)
return x, y, z
def euclidean_distance(distance):
"""Return the approximate Euclidean distance corresponding to the
given great circle distance (in km).
"""
return 2 * A * sin(distance / (2 * B))
Let's make up fifty thousand random city locations and convert them to ECEF coordinates:
>>> from random import uniform
>>> cities = [(uniform(-90, 90), uniform(0, 360)) for _ in range(50000)]
>>> ecef_cities = [geodetic2ecef(lat, lon) for lat, lon in cities]
Put them into a scipy.spatial.KDTree
:
>>> import numpy
>>> from scipy.spatial import KDTree
>>> tree = KDTree(numpy.array(ecef_cities))
Find all cities within about 100 km of London:
>>> london = geodetic2ecef(51, 0)
>>> tree.query_ball_point([london], r=euclidean_distance(100))
array([[37810, 15755, 16276]], dtype=object)
This array contains, for each point that you queried, an array the cities within the distance r
. Each neighbour is given as its index in the original array that you passed to KDTree
. So there are three cities within about 100 km of London, namely the cities with indexes 37810, 15755, and 16276 in the original list:
>>> from pprint import pprint
>>> pprint([cities[i] for i in [37810, 15755, 16276]])
[(51.7186871990946, 359.8043453670437),
(50.82734317063884, 1.1422052710187103),
(50.95466110717763, 0.8956257749604779)]
Notes:
You can see from the example output that neighbours with longitudes that differ by about 360° are correctly discovered.
The approach seems fast enough. Here we find neighbours within 30 km for the first thousand cities, taking about 5 seconds:
>>> from timeit import timeit
>>> timeit(lambda:tree.query_ball_point(ecef_cities[:1000], r=euclidean_distance(30)), number=1)
5.013611573027447
Extrapolating, we expect to find neighbours within 30 km for all 50,000 cities in about four minutes.
My euclidean_distance
function overestimates the Euclidean distance corresponding to a given great circle distance (so as not to miss any cities). This might be good enough for some applications—after all, cities are not point objects—but if you need more accuracy than this, then you could filter the resulting points using, say, one of the great circle distance functions from geopy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With