Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the nearest point in distance for all the points in the dataset - Python

I have a dataset as follows,

Id     Latitude      longitude
1      25.42         55.47
2      25.39         55.47
3      24.48         54.38
4      24.51         54.54

I want to find the nearest distance for every point for the dataset. I found the following distance function in the internet,

from math import radians, cos, sin, asin, sqrt
def distance(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c
    return km

I am using the following function,

shortest_distance = []
for i in range(1,len(data)):
    distance1 = []
    for j in range(1,len(data)):
        distance1.append(distance(data['Longitude'][i], data['Latitude'][i], data['Longitude'][j], data['Latitude'][j]))
    shortest_distance.append(min(distance1))

But this code is looping twice for each entry and return n^2 iterations and in turn it is very slow. My dataset contains nearly 1 million records and each time looping through all the elements twice is becoming very costly.

I want to find the better way to find out the nearest point for each row. Can anybody help me in finding a way to solve this in python ?

Thanks

like image 521
haimen Avatar asked Jul 16 '17 09:07

haimen


People also ask

How do you find the nearest point in Python?

The brute force method of finding the nearest of N points to a given point is O(N) -- you'd have to check each point. In contrast, if the N points are stored in a KD-tree, then finding the nearest point is on average O(log(N)) .

How do you find the closest point to a set of points?

The idea is to split the point set into two halves with a vertical line, find the closest pair within each half, and then find the closest pair between the two halves. The result of finding the closest pair within each half speeds up finding the closest pair between the halves.

How do you find the distance from one point to another in python?

The math. dist() method returns the Euclidean distance between two points (p and q), where p and q are the coordinates of that point. Note: The two points (p and q) must be of the same dimensions.

How do you find the distance between two vectors in Python?

dist() method in Python is used to the Euclidean distance between two points p and q, each given as a sequence (or iterable) of coordinates. The two points must have the same dimension. This method is new in Python version 3.8. Returns: the calculated Euclidean distance between the given points.


1 Answers

The brute force method of finding the nearest of N points to a given point is O(N) -- you'd have to check each point. In contrast, if the N points are stored in a KD-tree, then finding the nearest point is on average O(log(N)). There is also the additional one-time cost of building the KD-tree, which requires O(N) time.

If you need to repeat this process N times, then the brute force method is O(N**2) and the kd-tree method is O(N*log(N)). Thus, for large enough N, the KD-tree will beat the brute force method.

See here for more on nearest neighbor algorithms (including KD-tree).


Below (in the function using_kdtree) is a way to compute the great circle arclengths of nearest neighbors using scipy.spatial.kdtree.

scipy.spatial.kdtree uses the Euclidean distance between points, but there is a formula for converting Euclidean chord distances between points on a sphere to great circle arclength (given the radius of the sphere). So the idea is to convert the latitude/longitude data into cartesian coordinates, use a KDTree to find the nearest neighbors, and then apply the great circle distance formula to obtain the desired result.


Here are some benchmarks. Using N = 100, using_kdtree is 39x faster than the orig (brute force) method.

In [180]: %timeit using_kdtree(data)
100 loops, best of 3: 18.6 ms per loop

In [181]: %timeit using_sklearn(data)
1 loop, best of 3: 214 ms per loop

In [179]: %timeit orig(data)
1 loop, best of 3: 728 ms per loop

For N = 10000:

In [5]: %timeit using_kdtree(data)
1 loop, best of 3: 2.78 s per loop

In [6]: %timeit using_sklearn(data)
1 loop, best of 3: 1min 15s per loop

In [7]: %timeit orig(data)
# untested; too slow

Since using_kdtree is O(N log(N)) and orig is O(N**2), the factor by which using_kdtree is faster than orig will grow as N, the length of data, grows.


import numpy as np
import scipy.spatial as spatial
import pandas as pd
import sklearn.neighbors as neighbors
from math import radians, cos, sin, asin, sqrt

R = 6367

def using_kdtree(data):
    "Based on https://stackoverflow.com/q/43020919/190597"
    def dist_to_arclength(chord_length):
        """
        https://en.wikipedia.org/wiki/Great-circle_distance
        Convert Euclidean chord length to great circle arc length
        """
        central_angle = 2*np.arcsin(chord_length/(2.0*R)) 
        arclength = R*central_angle
        return arclength

    phi = np.deg2rad(data['Latitude'])
    theta = np.deg2rad(data['Longitude'])
    data['x'] = R * np.cos(phi) * np.cos(theta)
    data['y'] = R * np.cos(phi) * np.sin(theta)
    data['z'] = R * np.sin(phi)
    tree = spatial.KDTree(data[['x', 'y','z']])
    distance, index = tree.query(data[['x', 'y','z']], k=2)
    return dist_to_arclength(distance[:, 1])

def orig(data):
    def distance(lon1, lat1, lon2, lat2):
        """
        Calculate the great circle distance between two points 
        on the earth (specified in decimal degrees)
        """
        # convert decimal degrees to radians 
        lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
        # haversine formula 
        dlon = lon2 - lon1 
        dlat = lat2 - lat1 
        a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
        c = 2 * asin(sqrt(a)) 
        km = R * c
        return km

    shortest_distance = []
    for i in range(len(data)):
        distance1 = []
        for j in range(len(data)):
            if i == j: continue
            distance1.append(distance(data['Longitude'][i], data['Latitude'][i], 
                                      data['Longitude'][j], data['Latitude'][j]))
        shortest_distance.append(min(distance1))
    return shortest_distance


def using_sklearn(data):
    """
    Based on https://stackoverflow.com/a/45127250/190597 (Jonas Adler)
    """
    def distance(p1, p2):
        """
        Calculate the great circle distance between two points
        on the earth (specified in decimal degrees)
        """
        lon1, lat1 = p1
        lon2, lat2 = p2
        # convert decimal degrees to radians
        lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
        # haversine formula
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
        c = 2 * np.arcsin(np.sqrt(a))
        km = R * c
        return km
    points = data[['Longitude', 'Latitude']]
    nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
    distances, indices = nbrs.kneighbors(points)
    result = distances[:, 1]
    return result

np.random.seed(2017)
N = 1000
data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N), 
                     'Longitude':np.random.uniform(0,360,size=N)})

expected = orig(data)
for func in [using_kdtree, using_sklearn]:
    result = func(data)
    assert np.allclose(expected, result)
like image 147
unutbu Avatar answered Oct 17 '22 16:10

unutbu