Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I make a distance matrix with own metric using no loop?

I have a np.arrray like this:

[[ 1.3 , 2.7 , 0.5 , NaN , NaN],
[ 2.0 , 8.9 , 2.5 , 5.6 , 3.5],
[ 0.6 , 3.4 , 9.5 , 7.4 , NaN]]

And a function to compute the distance between two rows:

def nan_manhattan(X, Y):
    nan_diff = np.absolute(X - Y)
    length = nan_diff.size
    return np.nansum(nan_diff) * length / (length - np.isnan(nan_diff).sum())

I need all pairwise distances, and I don't want to use a loop. How do I do that?

like image 394
HrkBrkkl Avatar asked Nov 04 '20 08:11

HrkBrkkl


2 Answers

Leveraging broadcasting -

def manhattan_nan(a):
    s = np.nansum(np.abs(a[:,None,:] - a), axis=-1)
    m = ~np.isnan(a)
    k = m.sum(1)
    r = a.shape[1]/np.minimum.outer(k,k)
    out = s*r
    return out

Benchmarking

From OP's comments, the use-case seems to be a tall array. Let's reproduce one for benchmarking re-using given sample data :

In [2]: a
Out[2]: 
array([[1.3, 2.7, 0.5, nan, nan],
       [2. , 8.9, 2.5, 5.6, 3.5],
       [0.6, 3.4, 9.5, 7.4, nan]])

In [3]: a = np.repeat(a, 100, axis=0)

# @Dani Mesejo's soln
In [4]: %timeit pdist(a, nan_manhattan)
1.02 s ± 35.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Naive for-loop version
In [18]: n = a.shape[0]

In [19]: %timeit [[nan_manhattan(a[i], a[j]) for i in range(j+1,n)] for j in range(n)]
991 ms ± 45.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# With broadcasting
In [9]: %timeit manhattan_nan(a)
8.43 ms ± 49.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 64
Divakar Avatar answered Sep 30 '22 09:09

Divakar


Use pdist:

import numpy as np
from scipy.spatial.distance import pdist, squareform


def nan_manhattan(X, Y):
    nan_diff = np.absolute(X - Y)
    length = nan_diff.size
    return np.nansum(nan_diff) * length / (length - np.isnan(nan_diff).sum())


arr = np.array([[1.3, 2.7, 0.5, np.nan, np.nan],
                [2.0, 8.9, 2.5, 5.6, 3.5],
                [0.6, 3.4, 9.5, 7.4, np.nan]])

result = squareform(pdist(arr, nan_manhattan))

print(result)

Output

[[ 0.         14.83333333 17.33333333]
 [14.83333333  0.         19.625     ]
 [17.33333333 19.625       0.        ]]
like image 32
Dani Mesejo Avatar answered Sep 30 '22 09:09

Dani Mesejo