I have a np.arrray like this:
[[ 1.3 , 2.7 , 0.5 , NaN , NaN],
[ 2.0 , 8.9 , 2.5 , 5.6 , 3.5],
[ 0.6 , 3.4 , 9.5 , 7.4 , NaN]]
And a function to compute the distance between two rows:
def nan_manhattan(X, Y):
nan_diff = np.absolute(X - Y)
length = nan_diff.size
return np.nansum(nan_diff) * length / (length - np.isnan(nan_diff).sum())
I need all pairwise distances, and I don't want to use a loop. How do I do that?
Leveraging broadcasting
-
def manhattan_nan(a):
s = np.nansum(np.abs(a[:,None,:] - a), axis=-1)
m = ~np.isnan(a)
k = m.sum(1)
r = a.shape[1]/np.minimum.outer(k,k)
out = s*r
return out
From OP's comments, the use-case seems to be a tall array. Let's reproduce one for benchmarking re-using given sample data :
In [2]: a
Out[2]:
array([[1.3, 2.7, 0.5, nan, nan],
[2. , 8.9, 2.5, 5.6, 3.5],
[0.6, 3.4, 9.5, 7.4, nan]])
In [3]: a = np.repeat(a, 100, axis=0)
# @Dani Mesejo's soln
In [4]: %timeit pdist(a, nan_manhattan)
1.02 s ± 35.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Naive for-loop version
In [18]: n = a.shape[0]
In [19]: %timeit [[nan_manhattan(a[i], a[j]) for i in range(j+1,n)] for j in range(n)]
991 ms ± 45.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# With broadcasting
In [9]: %timeit manhattan_nan(a)
8.43 ms ± 49.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Use pdist:
import numpy as np
from scipy.spatial.distance import pdist, squareform
def nan_manhattan(X, Y):
nan_diff = np.absolute(X - Y)
length = nan_diff.size
return np.nansum(nan_diff) * length / (length - np.isnan(nan_diff).sum())
arr = np.array([[1.3, 2.7, 0.5, np.nan, np.nan],
[2.0, 8.9, 2.5, 5.6, 3.5],
[0.6, 3.4, 9.5, 7.4, np.nan]])
result = squareform(pdist(arr, nan_manhattan))
print(result)
Output
[[ 0. 14.83333333 17.33333333]
[14.83333333 0. 19.625 ]
[17.33333333 19.625 0. ]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With