I have a dataset as follows,
Id Latitude longitude
1 25.42 55.47
2 25.39 55.47
3 24.48 54.38
4 24.51 54.54
I want to find the nearest distance for every point for the dataset. I found the following distance function in the internet,
from math import radians, cos, sin, asin, sqrt
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
km = 6367 * c
return km
I am using the following function,
shortest_distance = []
for i in range(1,len(data)):
distance1 = []
for j in range(1,len(data)):
distance1.append(distance(data['Longitude'][i], data['Latitude'][i], data['Longitude'][j], data['Latitude'][j]))
shortest_distance.append(min(distance1))
But this code is looping twice for each entry and return n^2 iterations and in turn it is very slow. My dataset contains nearly 1 million records and each time looping through all the elements twice is becoming very costly.
I want to find the better way to find out the nearest point for each row. Can anybody help me in finding a way to solve this in python ?
Thanks
The brute force method of finding the nearest of N points to a given point is O(N) -- you'd have to check each point. In contrast, if the N points are stored in a KD-tree, then finding the nearest point is on average O(log(N)) .
The idea is to split the point set into two halves with a vertical line, find the closest pair within each half, and then find the closest pair between the two halves. The result of finding the closest pair within each half speeds up finding the closest pair between the halves.
The math. dist() method returns the Euclidean distance between two points (p and q), where p and q are the coordinates of that point. Note: The two points (p and q) must be of the same dimensions.
dist() method in Python is used to the Euclidean distance between two points p and q, each given as a sequence (or iterable) of coordinates. The two points must have the same dimension. This method is new in Python version 3.8. Returns: the calculated Euclidean distance between the given points.
The brute force method of finding the nearest of N
points to a given point is O(N)
-- you'd have to check each point.
In contrast, if the N
points are stored in a KD-tree, then finding the nearest point is on average O(log(N))
.
There is also the additional one-time cost of building the KD-tree, which requires O(N)
time.
If you need to repeat this process N
times, then the brute force method is O(N**2)
and the kd-tree method is O(N*log(N))
.
Thus, for large enough N
, the KD-tree will beat the brute force method.
See here for more on nearest neighbor algorithms (including KD-tree).
Below (in the function using_kdtree
) is a way to compute the great circle arclengths of nearest neighbors using scipy.spatial.kdtree
.
scipy.spatial.kdtree
uses the Euclidean distance between points, but there is a formula for converting Euclidean chord distances between points on a sphere to great circle arclength (given the radius of the sphere).
So the idea is to convert the latitude/longitude data into cartesian coordinates, use a KDTree
to find the nearest neighbors, and then apply the great circle distance formula to obtain the desired result.
Here are some benchmarks. Using N = 100
, using_kdtree
is 39x faster than the orig
(brute force) method.
In [180]: %timeit using_kdtree(data)
100 loops, best of 3: 18.6 ms per loop
In [181]: %timeit using_sklearn(data)
1 loop, best of 3: 214 ms per loop
In [179]: %timeit orig(data)
1 loop, best of 3: 728 ms per loop
For N = 10000
:
In [5]: %timeit using_kdtree(data)
1 loop, best of 3: 2.78 s per loop
In [6]: %timeit using_sklearn(data)
1 loop, best of 3: 1min 15s per loop
In [7]: %timeit orig(data)
# untested; too slow
Since using_kdtree
is O(N log(N))
and orig
is O(N**2)
, the factor by
which using_kdtree
is faster than orig
will grow as N
, the length of
data
, grows.
import numpy as np
import scipy.spatial as spatial
import pandas as pd
import sklearn.neighbors as neighbors
from math import radians, cos, sin, asin, sqrt
R = 6367
def using_kdtree(data):
"Based on https://stackoverflow.com/q/43020919/190597"
def dist_to_arclength(chord_length):
"""
https://en.wikipedia.org/wiki/Great-circle_distance
Convert Euclidean chord length to great circle arc length
"""
central_angle = 2*np.arcsin(chord_length/(2.0*R))
arclength = R*central_angle
return arclength
phi = np.deg2rad(data['Latitude'])
theta = np.deg2rad(data['Longitude'])
data['x'] = R * np.cos(phi) * np.cos(theta)
data['y'] = R * np.cos(phi) * np.sin(theta)
data['z'] = R * np.sin(phi)
tree = spatial.KDTree(data[['x', 'y','z']])
distance, index = tree.query(data[['x', 'y','z']], k=2)
return dist_to_arclength(distance[:, 1])
def orig(data):
def distance(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
c = 2 * asin(sqrt(a))
km = R * c
return km
shortest_distance = []
for i in range(len(data)):
distance1 = []
for j in range(len(data)):
if i == j: continue
distance1.append(distance(data['Longitude'][i], data['Latitude'][i],
data['Longitude'][j], data['Latitude'][j]))
shortest_distance.append(min(distance1))
return shortest_distance
def using_sklearn(data):
"""
Based on https://stackoverflow.com/a/45127250/190597 (Jonas Adler)
"""
def distance(p1, p2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
lon1, lat1 = p1
lon2, lat2 = p2
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
km = R * c
return km
points = data[['Longitude', 'Latitude']]
nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
distances, indices = nbrs.kneighbors(points)
result = distances[:, 1]
return result
np.random.seed(2017)
N = 1000
data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N),
'Longitude':np.random.uniform(0,360,size=N)})
expected = orig(data)
for func in [using_kdtree, using_sklearn]:
result = func(data)
assert np.allclose(expected, result)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With