Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find nearest point in other dataframe (WITH A LOT OF DATA)

The problem is simple i have two DataFrame :

  • one with 90 000 apartment and their latitude/longitude

  • and one with 3 000 pharmacy and their latitude/longitude

And i want to create a new variable for all my apartments : 'distance of the nearest pharmacy'

For this i tried two methods which spend to much time:

First method : I created a matrix with my apartments in row and my pharmacies in columns and the distance betwenn them in the intersection, after that i just take the min of the matrix to have a column vector of 90 000 value

I just use a double for with numpy :

m,n=len(result['latitude']),len(pharma['lat'])
M = np.ones((m,n))
for i in range(m):
     for j in range(n):
        if (result['Code departement'][i]==pharma['departement'][j]):
            M[i,j] =(pharma['lat'][j]-result['latitude'][i])**2+(pharma['lng'][j]-result['longitude'] [i])**2

ps : i know that the wrong formula for lat/long but the appartments are in the same region so it's a good aproximation

Second method : I use the solution from this topics (who are the same problem but with less data) https://gis.stackexchange.com/questions/222315/geopandas-find-nearest-point-in-other-dataframe

I used geopandas et the nearest method :

from shapely.ops import nearest_points
pts3 = pharma.geometry.unary_union


def near(point, pts=pts3):
     nearest = pharma.geometry == nearest_points(point, pts)[1]
     return pharma[nearest].geometry.get_values()[0]

appart['Nearest'] = appart.apply(lambda row: near(row.geometry), axis=1)

And as i said, both methods spend too much time, after 1 hour of running my pc/notebook crashed and it failed.

My final question : have you a optimized method to go faster ? it is possible ? If it's already optimized, i will buy a other PC but which criteria but what criteria to look for to have a PC capable of doing such a quick calculation ?

like image 431
Arnaud H Avatar asked Nov 16 '19 18:11

Arnaud H


1 Answers

I guess a Ball Tree is an appropriate structure for this task.

You can use the scikit-learn implementation, see the code below for an example adapted to your case :

import numpy as np
import geopandas as gpd
from shapely.geometry import Point
from sklearn.neighbors import BallTree

## Create the two GeoDataFrame to replicate your dataset
appart = gpd.GeoDataFrame({
        'geometry': Point(a, b),
        'x': a,
        'y': b,
    } for a, b in zip(np.random.rand(100000), np.random.rand(100000))
])

pharma = gpd.GeoDataFrame([{
        'geometry': Point(a, b),
        'x': a,
        'y': b,
    } for a, b in zip(np.random.rand(3000), np.random.rand(3000))
])

# Create a BallTree 
tree = BallTree(pharma[['x', 'y']].values, leaf_size=2)

# Query the BallTree on each feature from 'appart' to find the distance
# to the nearest 'pharma' and its id
appart['distance_nearest'], appart['id_nearest'] = tree.query(
    appart[['x', 'y']].values, # The input array for the query
    k=1, # The number of nearest neighbors
)

With this method you can solve your problem pretty quickly (the above example, on my computer, took less than a second to find the index of the nearest point, out of 3000 points, on an input dataset of 100000 points).

By default the query method of BallTree is returning the distance to the nearest neighbor and its id. If you want you can disable returning the distance of this nearest neighbor by setting the return_distance parameter to False. If you really only care about the distance, you can save only this value :

appart['distance_nearest'], _ = tree.query(appart[['x', 'y']].values, k=1)
like image 75
mgc Avatar answered Nov 14 '22 22:11

mgc