Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the nearest neighbour index from one series to another

I have a target array A, which represents isobaric pressure levels in NCEP reanalysis data. I also have the pressure at which a cloud is observed as a long time series, B.

What I am looking for is a k-nearest neighbour lookup that returns the indices of those nearest neighbours, something like knnsearch in Matlab that could be represented the same in python such as: indices, distance = knnsearch(A, B, n) where indices is the nearest n indices in A for every value in B, and distance is how far removed the value in B is from the nearest value in A, and A and B can be of different lengths (this is the bottleneck that I have found with most solutions so far, whereby I would have to loop each value in B to return my indices and distance)

import numpy as np

A = np.array([1000, 925, 850, 700, 600, 500, 400, 300, 250, 200, 150, 100, 70, 50, 30, 20, 10]) # this is a fixed 17-by-1 array
B = np.array([923, 584.2, 605.3, 153.2]) # this can be any n-by-1 array
n = 2

What I would like returned from indices, distance = knnsearch(A, B, n) is this:

indices = [[1, 2],[4, 5] etc...] 

where 923 in A is matched to first A[1]=925 and then A[2]=850 and 584.2 in A is matched to first A[4]=600 and then A[5]=500

distance = [[72, 77],[15.8, 84.2] etc...]

where 72 represents the distance between queried value in B to the nearest value in A e.g. distance[0, 0] == np.abs(B[0] - A[1])

The only solution I have been able to come up with is:

import numpy as np


def knnsearch(A, B, n):
    indices = np.zeros((len(B), n))
    distances = np.zeros((len(B), n))

    for i in range(len(B)):
        a = A
        for N in range(n):
            dif = np.abs(a - B[i])
            ind = np.argmin(dif)

            indices[i, N] = ind + N
            distances[i, N] = dif[ind + N]
            # remove this neighbour from from future consideration
            np.delete(a, ind)

    return indices, distances


array_A = np.array([1000, 925, 850, 700, 600, 500, 400, 300, 250, 200, 150, 100, 70, 50, 30, 20, 10])
array_B = np.array([923, 584.2, 605.3, 153.2])
neighbours = 2

indices, distances = knnsearch(array_A, array_B, neighbours)

print(indices)
print(distances)

returns:

[[ 1.  2.]
 [ 4.  5.]
 [ 4.  3.]
 [10. 11.]]

[[  2.   73. ]
 [ 15.8  84.2]
 [  5.3  94.7]
 [  3.2  53.2]]

There must be a way to remove the for loops, as I need the performance should my A and B arrays contain many thousands of elements with many nearest neighbours...

Please help! Thanks :)

like image 815
JBright Avatar asked Sep 17 '25 19:09

JBright


1 Answers

The second loop can easily be vectorized. The most straightforward way to do it is to use np.argsort and select the indices corresponding to the n smallest dif values. However, for large arrays, as only n values should be sorted, it is better to use np.argpartition.

Therefore, the code would look like something like that:

def vector_knnsearch(A, B, n):
    indices = np.empty((len(B), n))
    distances = np.empty((len(B), n))

    for i,b in enumerate(B):
        dif = np.abs(A - b)
        min_ind = np.argpartition(dif,n)[:n] # Returns the indexes of the 3 smallest
                                             # numbers but not necessarily sorted
        ind = min_ind[np.argsort(dif[min_ind])] # sort output of argpartition just in case
        indices[i, :] = ind
        distances[i, :] = dif[ind]

    return indices, distances

As said in the comments, the first loop can also be removed using a meshgrid, however, the extra use of memory and computation time to construct the meshgrid makes this approach slower for the dimensions I tried (and this will probably get worse for large arrays and end up in Memory Error). In addition, the readability of the code decreases. Overall, this would probably do this approach less pythonic.

def mesh_knnsearch(A, B, n):
    m = len(B)
    rng = np.arange(m).reshape((m,1))
    Amesh, Bmesh = np.meshgrid(A,B)
    dif = np.abs(Amesh-Bmesh)
    min_ind = np.argpartition(dif,n,axis=1)[:,:n]
    ind = min_ind[rng,np.argsort(dif[rng,min_ind],axis=1)]

    return ind, dif[rng,ind]

Not that it is important to define this rng as a 2d array in order to retrieve a[rng[0],ind[0]], a[rng[1],ind[1]], etc and maintain the dimensions of the array, as opposed to a[:,ind] which retrieves a[:,ind[0]], a[:,ind[1]], etc.

like image 65
OriolAbril Avatar answered Sep 20 '25 09:09

OriolAbril