I have a target array A, which represents isobaric pressure levels in NCEP reanalysis data. I also have the pressure at which a cloud is observed as a long time series, B.
What I am looking for is a k-nearest neighbour lookup that returns the indices of those nearest neighbours, something like knnsearch
in Matlab that could be represented the same in python such as: indices, distance = knnsearch(A, B, n)
where indices
is the nearest n
indices in A
for every value in B
, and distance
is how far removed the value in B
is from the nearest value in A
, and A
and B
can be of different lengths (this is the bottleneck that I have found with most solutions so far, whereby I would have to loop each value in B
to return my indices
and distance
)
import numpy as np
A = np.array([1000, 925, 850, 700, 600, 500, 400, 300, 250, 200, 150, 100, 70, 50, 30, 20, 10]) # this is a fixed 17-by-1 array
B = np.array([923, 584.2, 605.3, 153.2]) # this can be any n-by-1 array
n = 2
What I would like returned from indices, distance = knnsearch(A, B, n)
is this:
indices = [[1, 2],[4, 5] etc...]
where 923
in A
is matched to first A[1]=925
and then A[2]=850
and 584.2
in A
is matched to first A[4]=600
and then A[5]=500
distance = [[72, 77],[15.8, 84.2] etc...]
where 72
represents the distance between queried value in B
to the nearest value in A
e.g. distance[0, 0] == np.abs(B[0] - A[1])
The only solution I have been able to come up with is:
import numpy as np
def knnsearch(A, B, n):
indices = np.zeros((len(B), n))
distances = np.zeros((len(B), n))
for i in range(len(B)):
a = A
for N in range(n):
dif = np.abs(a - B[i])
ind = np.argmin(dif)
indices[i, N] = ind + N
distances[i, N] = dif[ind + N]
# remove this neighbour from from future consideration
np.delete(a, ind)
return indices, distances
array_A = np.array([1000, 925, 850, 700, 600, 500, 400, 300, 250, 200, 150, 100, 70, 50, 30, 20, 10])
array_B = np.array([923, 584.2, 605.3, 153.2])
neighbours = 2
indices, distances = knnsearch(array_A, array_B, neighbours)
print(indices)
print(distances)
returns:
[[ 1. 2.]
[ 4. 5.]
[ 4. 3.]
[10. 11.]]
[[ 2. 73. ]
[ 15.8 84.2]
[ 5.3 94.7]
[ 3.2 53.2]]
There must be a way to remove the for loops, as I need the performance should my A and B arrays contain many thousands of elements with many nearest neighbours...
Please help! Thanks :)
The second loop can easily be vectorized. The most straightforward way to do it is to use np.argsort
and select the indices corresponding to the n smallest dif
values. However, for large arrays, as only n values should be sorted, it is better to use np.argpartition
.
Therefore, the code would look like something like that:
def vector_knnsearch(A, B, n):
indices = np.empty((len(B), n))
distances = np.empty((len(B), n))
for i,b in enumerate(B):
dif = np.abs(A - b)
min_ind = np.argpartition(dif,n)[:n] # Returns the indexes of the 3 smallest
# numbers but not necessarily sorted
ind = min_ind[np.argsort(dif[min_ind])] # sort output of argpartition just in case
indices[i, :] = ind
distances[i, :] = dif[ind]
return indices, distances
As said in the comments, the first loop can also be removed using a meshgrid, however, the extra use of memory and computation time to construct the meshgrid makes this approach slower for the dimensions I tried (and this will probably get worse for large arrays and end up in Memory Error). In addition, the readability of the code decreases. Overall, this would probably do this approach less pythonic.
def mesh_knnsearch(A, B, n):
m = len(B)
rng = np.arange(m).reshape((m,1))
Amesh, Bmesh = np.meshgrid(A,B)
dif = np.abs(Amesh-Bmesh)
min_ind = np.argpartition(dif,n,axis=1)[:,:n]
ind = min_ind[rng,np.argsort(dif[rng,min_ind],axis=1)]
return ind, dif[rng,ind]
Not that it is important to define this rng
as a 2d array in order to retrieve a[rng[0],ind[0]]
, a[rng[1],ind[1]]
, etc and maintain the dimensions of the array, as opposed to a[:,ind]
which retrieves a[:,ind[0]]
, a[:,ind[1]]
, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With