Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding index of nearest point in numpy arrays of x and y coordinates

I have two 2d numpy arrays: x_array contains positional information in the x-direction, y_array contains positions in the y-direction.

I then have a long list of x,y points.

For each point in the list, I need to find the array index of the location (specified in the arrays) which is closest to that point.

I have naively produced some code which works, based on this question: Find nearest value in numpy array

i.e.

import time import numpy  def find_index_of_nearest_xy(y_array, x_array, y_point, x_point):     distance = (y_array-y_point)**2 + (x_array-x_point)**2     idy,idx = numpy.where(distance==distance.min())     return idy[0],idx[0]  def do_all(y_array, x_array, points):     store = []     for i in xrange(points.shape[1]):         store.append(find_index_of_nearest_xy(y_array,x_array,points[0,i],points[1,i]))     return store   # Create some dummy data y_array = numpy.random.random(10000).reshape(100,100) x_array = numpy.random.random(10000).reshape(100,100)  points = numpy.random.random(10000).reshape(2,5000)  # Time how long it takes to run start = time.time() results = do_all(y_array, x_array, points) end = time.time() print 'Completed in: ',end-start 

I'm doing this over a large dataset and would really like to speed it up a bit. Can anyone optimize this?

Thanks.


UPDATE: SOLUTION following suggestions by @silvado and @justin (below)

# Shoe-horn existing data for entry into KDTree routines combined_x_y_arrays = numpy.dstack([y_array.ravel(),x_array.ravel()])[0] points_list = list(points.transpose())   def do_kdtree(combined_x_y_arrays,points):     mytree = scipy.spatial.cKDTree(combined_x_y_arrays)     dist, indexes = mytree.query(points)     return indexes  start = time.time() results2 = do_kdtree(combined_x_y_arrays,points_list) end = time.time() print 'Completed in: ',end-start 

This code above sped up my code (searching for 5000 points in 100x100 matrices) by 100 times. Interestingly, using scipy.spatial.KDTree (instead of scipy.spatial.cKDTree) gave comparable timings to my naive solution, so it is definitely worth using the cKDTree version...

like image 737
Pete W Avatar asked May 30 '12 14:05

Pete W


People also ask

Does a NumPy array have an index?

Array indexing is the same as accessing an array element. You can access an array element by referring to its index number. The indexes in NumPy arrays start with 0, meaning that the first element has index 0, and the second has index 1 etc.

How do I index a NumPy 2d array?

Indexing a Two-dimensional Array To access elements in this array, use two indices. One for the row and the other for the column. Note that both the column and the row indices start with 0. So if I need to access the value '10,' use the index '3' for the row and index '1' for the column.


2 Answers

Here is a scipy.spatial.KDTree example

In [1]: from scipy import spatial  In [2]: import numpy as np  In [3]: A = np.random.random((10,2))*100  In [4]: A Out[4]: array([[ 68.83402637,  38.07632221],        [ 76.84704074,  24.9395109 ],        [ 16.26715795,  98.52763827],        [ 70.99411985,  67.31740151],        [ 71.72452181,  24.13516764],        [ 17.22707611,  20.65425362],        [ 43.85122458,  21.50624882],        [ 76.71987125,  44.95031274],        [ 63.77341073,  78.87417774],        [  8.45828909,  30.18426696]])  In [5]: pt = [6, 30]  # <-- the point to find  In [6]: A[spatial.KDTree(A).query(pt)[1]] # <-- the nearest point  Out[6]: array([  8.45828909,  30.18426696])  #how it works! In [7]: distance,index = spatial.KDTree(A).query(pt)  In [8]: distance # <-- The distances to the nearest neighbors Out[8]: 2.4651855048258393  In [9]: index # <-- The locations of the neighbors Out[9]: 9  #then  In [10]: A[index] Out[10]: array([  8.45828909,  30.18426696]) 
like image 72
efirvida Avatar answered Sep 29 '22 00:09

efirvida


scipy.spatial also has a k-d tree implementation: scipy.spatial.KDTree.

The approach is generally to first use the point data to build up a k-d tree. The computational complexity of that is on the order of N log N, where N is the number of data points. Range queries and nearest neighbour searches can then be done with log N complexity. This is much more efficient than simply cycling through all points (complexity N).

Thus, if you have repeated range or nearest neighbor queries, a k-d tree is highly recommended.

like image 45
silvado Avatar answered Sep 28 '22 23:09

silvado