Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search numpy array ((x, y, z)...) for z matching nearest x, y

I have a very large array similar to elevation data of the format:

triplets = ((x0, y0, z0), 
            (x1, y1, z1), 
            ... ,
            (xn, yn, zn))

where x, y, z are all floats in metres. You can create suitable test data matching this format with:

x = np.arange(20, 40, dtype=np.float64)
y = np.arange(30, 50, dtype=np.float64)
z = np.random.random(20) * 25.0
triplets = np.hstack((x, y, z)).reshape((len(x),3))

I want to be able to efficiently find the corresponding z-value for a given (x,y) pair. My research so far leads to more questions. Here's what I've got:

  1. Iterate through all of the triplets:

    query = (a, b) # where a, b are the x and y coordinates we're looking for
    for i in triplets:
      if i[0] == query[0] and i[1] == query[1]:
        result = i[2]
    

    Drawbacks: slow; a, b must exist, which is a problem with comparing floats.

  2. Use scipy.spatial.cKDTree to find nearest:

    points = triplets[:,0:2] # drops the z column
    tree = cKDTree(points)
    idx = tree.query((a, b))[1] # this returns a tuple, we want the index
    query = tree.data[idx]
    result = triplets[idx, 2]
    

    Drawbacks: returns nearest point rather than interpolating.

  3. Using interp2d as per comment:

    f = interp2d(x, y, z)
    result = f(a, b)
    

    Drawbacks: doesn't work on a large dataset. I get OverflowError: Too many data points to interpolate when run on real data. (My real data is some 11 million points.)

So the question is: is there any straightforward way of doing this that I'm overlooking? Are there ways to reduce the drawbacks of the above?

like image 975
troy.unrau Avatar asked May 20 '14 20:05

troy.unrau


People also ask

How do I compare values in two NumPy arrays?

Method 1: We generally use the == operator to compare two NumPy arrays to generate a new array object. Call ndarray. all() with the new array object as ndarray to return True if the two NumPy arrays are equivalent.

What is c_ in NumPy?

c_ = <numpy.lib.index_tricks.CClass object> Translates slice objects to concatenation along the second axis. This is short-hand for np. r_['-1,2,0', index expression] , which is useful because of its common occurrence.


2 Answers

If you want to interpolate the result, rather than just find the z value for the nearest neighbour, I would consider doing something like the following:

  1. Use a k-d tree to partition your data points according to their (x, y) coordinates
  2. For a given (xi, yi) point to interpolate, find its k nearest neighbours
  3. Take the average of their z values, weighted according to their distance from (xi, yi)

The code might look something like this:

import numpy as np
from scipy.spatial import cKDTree

# some fake (x, y, z) data
XY = np.random.rand(10000, 2) - 0.5
Z = np.exp(-((XY ** 2).sum(1) / 0.1) ** 2)

# construct a k-d tree from the (x, y) coordinates
tree = cKDTree(XY)

# a random point to query
xy = np.random.rand(2) - 0.5

# find the k nearest neighbours (say, k=3)
distances, indices = tree.query(xy, k=3)

# the z-values for the k nearest neighbours of xy
z_vals = Z[indices]

# take the average of these z-values, weighted by 1 / distance from xy
dw_avg = np.average(z_vals, weights=(1. / distances))

It's worth playing around a bit with the value of k, the number of nearest neighbours to take the average of. This is essentially a crude form of kernel density estimation, where the value of k controls the degree of 'smoothness' you're imposing on the underlying distribution of z-values. A larger k results in more smoothness.

Similarly, you might want to play around with how you weight the contributions of points according to their distance from (xi, yi), depending on how you think similarity in z decreases with increasing x, y distance. For example you might want to weight by (1 / distances ** 2) rather than (1 / distances).

In terms of performance, constructing and searching k-d trees are both very efficient. Bear in mind that you only need to construct the tree once for your dataset, and if necessary you can query multiple points at a time by passing (N, 2) arrays to tree.query().

Tools for approximate nearest neighbour searches, such as FLANN, might potentially be quicker, but these are usually more helpful in situations when the dimensionality of your data is very high.

like image 106
ali_m Avatar answered Sep 18 '22 02:09

ali_m


I don't understand your cKDTree code, you got the idx, why do the for loop again? You can get the result just by result = triplets[idx, 2].

from scipy.spatial import cKDTree

x = np.arange(20, 40, dtype=np.float64)
y = np.arange(30, 50, dtype=np.float64)
z = np.random.random(20) * 25.0
triplets = np.hstack((x, y, z)).reshape((len(x),3))

a = 30.1
b = 40.5

points = triplets[:,0:2] # drops the z column
tree = cKDTree(points)
idx = tree.query((a, b))[1] # this returns a tuple, we want the index
result = triplets[idx, 2]
like image 24
HYRY Avatar answered Sep 17 '22 02:09

HYRY