Find the nearest point in distance for all the points in the dataset - Python

Tags:

I have a dataset as follows,

Id     Latitude      longitude
1      25.42         55.47
2      25.39         55.47
3      24.48         54.38
4      24.51         54.54

I want to find the nearest distance for every point for the dataset. I found the following distance function in the internet,

from math import radians, cos, sin, asin, sqrt
def distance(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    km = 6367 * c
    return km

I am using the following function,

shortest_distance = []
for i in range(1,len(data)):
    distance1 = []
    for j in range(1,len(data)):
        distance1.append(distance(data['Longitude'][i], data['Latitude'][i], data['Longitude'][j], data['Latitude'][j]))
    shortest_distance.append(min(distance1))

But this code is looping twice for each entry and return n^2 iterations and in turn it is very slow. My dataset contains nearly 1 million records and each time looping through all the elements twice is becoming very costly.

I want to find the better way to find out the nearest point for each row. Can anybody help me in finding a way to solve this in python ?

Thanks

521

asked Jul 16 '17 09:07

haimen

1 Answers

The brute force method of finding the nearest of N points to a given point is O(N) -- you'd have to check each point. In contrast, if the N points are stored in a KD-tree, then finding the nearest point is on average O(log(N)). There is also the additional one-time cost of building the KD-tree, which requires O(N) time.

If you need to repeat this process N times, then the brute force method is O(N**2) and the kd-tree method is O(N*log(N)). Thus, for large enough N, the KD-tree will beat the brute force method.

See here for more on nearest neighbor algorithms (including KD-tree).

Below (in the function using_kdtree) is a way to compute the great circle arclengths of nearest neighbors using scipy.spatial.kdtree.

scipy.spatial.kdtree uses the Euclidean distance between points, but there is a formula for converting Euclidean chord distances between points on a sphere to great circle arclength (given the radius of the sphere). So the idea is to convert the latitude/longitude data into cartesian coordinates, use a KDTree to find the nearest neighbors, and then apply the great circle distance formula to obtain the desired result.

Here are some benchmarks. Using N = 100, using_kdtree is 39x faster than the orig (brute force) method.

In [180]: %timeit using_kdtree(data)
100 loops, best of 3: 18.6 ms per loop

In [181]: %timeit using_sklearn(data)
1 loop, best of 3: 214 ms per loop

In [179]: %timeit orig(data)
1 loop, best of 3: 728 ms per loop

For N = 10000:

In [5]: %timeit using_kdtree(data)
1 loop, best of 3: 2.78 s per loop

In [6]: %timeit using_sklearn(data)
1 loop, best of 3: 1min 15s per loop

In [7]: %timeit orig(data)
# untested; too slow

Since using_kdtree is O(N log(N)) and orig is O(N**2), the factor by which using_kdtree is faster than orig will grow as N, the length of data, grows.

import numpy as np
import scipy.spatial as spatial
import pandas as pd
import sklearn.neighbors as neighbors
from math import radians, cos, sin, asin, sqrt

R = 6367

def using_kdtree(data):
    "Based on https://stackoverflow.com/q/43020919/190597"
    def dist_to_arclength(chord_length):
        """
        https://en.wikipedia.org/wiki/Great-circle_distance
        Convert Euclidean chord length to great circle arc length
        """
        central_angle = 2*np.arcsin(chord_length/(2.0*R)) 
        arclength = R*central_angle
        return arclength

    phi = np.deg2rad(data['Latitude'])
    theta = np.deg2rad(data['Longitude'])
    data['x'] = R * np.cos(phi) * np.cos(theta)
    data['y'] = R * np.cos(phi) * np.sin(theta)
    data['z'] = R * np.sin(phi)
    tree = spatial.KDTree(data[['x', 'y','z']])
    distance, index = tree.query(data[['x', 'y','z']], k=2)
    return dist_to_arclength(distance[:, 1])

def orig(data):
    def distance(lon1, lat1, lon2, lat2):
        """
        Calculate the great circle distance between two points 
        on the earth (specified in decimal degrees)
        """
        # convert decimal degrees to radians 
        lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
        # haversine formula 
        dlon = lon2 - lon1 
        dlat = lat2 - lat1 
        a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2
        c = 2 * asin(sqrt(a)) 
        km = R * c
        return km

    shortest_distance = []
    for i in range(len(data)):
        distance1 = []
        for j in range(len(data)):
            if i == j: continue
            distance1.append(distance(data['Longitude'][i], data['Latitude'][i], 
                                      data['Longitude'][j], data['Latitude'][j]))
        shortest_distance.append(min(distance1))
    return shortest_distance


def using_sklearn(data):
    """
    Based on https://stackoverflow.com/a/45127250/190597 (Jonas Adler)
    """
    def distance(p1, p2):
        """
        Calculate the great circle distance between two points
        on the earth (specified in decimal degrees)
        """
        lon1, lat1 = p1
        lon2, lat2 = p2
        # convert decimal degrees to radians
        lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
        # haversine formula
        dlon = lon2 - lon1
        dlat = lat2 - lat1
        a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
        c = 2 * np.arcsin(np.sqrt(a))
        km = R * c
        return km
    points = data[['Longitude', 'Latitude']]
    nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points)
    distances, indices = nbrs.kneighbors(points)
    result = distances[:, 1]
    return result

np.random.seed(2017)
N = 1000
data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N), 
                     'Longitude':np.random.uniform(0,360,size=N)})

expected = orig(data)
for func in [using_kdtree, using_sklearn]:
    result = func(data)
    assert np.allclose(expected, result)

147

answered Oct 17 '22 16:10

unutbu

Related questions
                            
                                Fastest way to find Indexes of item in list?
                            
                                How to filter a Spark dataframe by a boolean column?
                            
                                How to use Keras' multi layer perceptron for multi-class classification
                            
                                How to remove dates from a list in Python
                            
                                Can you have required keyword arguments in Javascript or Python?
                            
                                Speedup GPU vs CPU for matrix operations
                            
                                Pywinauto: unable to bring window to foreground
                            
                                ImportError: No module named 'queue' while running my app freezed with cx_freeze
                            
                                How to parse binary string to dict ?
                            
                                pip and pip3 - both pointing to python3.5?
                            
                                import my database connection with python
                            
                                append items from shuffled list to a new list
                            
                                How to calculate the coordinates of the line between two points in python?
                            
                                Pandas - group by id and drop duplicate with threshold
                            
                                Change directory on server before uploading files with ftplib in Python
                            
                                Python: inverse empirical cumulative distribution function (ECDF)?
                            
                                In Boto3, how to create a Paginator for list_objects with additional keyword arguments?
                            
                                Graph networkx from python dict
                            
                                python base64 to hex
                            
                                How to list the files in S3 subdirectory using Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Find the nearest point in distance for all the points in the dataset - Python

Tags:

python

python-3.x

python-2.7

haimen

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us