I am looking for NumPy way of calculating Mahalanobis distance between two numpy arrays (x and y). The following code can correctly calculate the same using cdist function of Scipy. Since this function calculates unnecessary matix in my case, I want more straight way of calculating it using NumPy only.
import numpy as np
from scipy.spatial.distance import cdist
x = np.array([[[1,2,3,4,5],
[5,6,7,8,5],
[5,6,7,8,5]],
[[11,22,23,24,5],
[25,26,27,28,5],
[5,6,7,8,5]]])
i,j,k = x.shape
xx = x.reshape(i,j*k).T
y = np.array([[[31,32,33,34,5],
[35,36,37,38,5],
[5,6,7,8,5]],
[[41,42,43,44,5],
[45,46,47,48,5],
[5,6,7,8,5]]])
yy = y.reshape(i,j*k).T
results = cdist(xx,yy,'mahalanobis')
results = np.diag(results)
print results
[ 2.28765854 2.75165028 2.75165028 2.75165028 0. 2.75165028
2.75165028 2.75165028 2.75165028 0. 0. 0. 0.
0. 0. ]
My trial:
VI = np.linalg.inv(np.cov(xx,yy))
print np.sqrt(np.dot(np.dot((xx-yy),VI),(xx-yy).T))
Could anybody correct this method?
Here is formula for it:
http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.mahalanobis.html#scipy.spatial.distance.mahalanobis
First you calculate the covariance matrix, (S in the equation, “covar mat” in the image). Then you find the inverse of S (“inv-covar” in the image). Then you subtract the mean from v: (66, 640, 44) – (68.0, 600.0, 40.0) to get v-m = (-2, 40, 4).
The lower the Mahalanobis Distance, the closer a point is to the set of benchmark points. A Mahalanobis Distance of 1 or lower shows that the point is right among the benchmark points. This is going to be a good one. The higher it gets from there, the further it is from where the benchmark points are.
A distance matrix contains the distances computed pairwise between the vectors of matrix/ matrices. scipy. spatial package provides us distance_matrix() method to compute the distance matrix. Generally matrices are in the form of 2-D array and the vectors of the matrix are matrix rows ( 1-D array).
I think your problem lies in the construction of your covariance matrix. Try:
X = np.vstack([xx,yy])
V = np.cov(X.T)
VI = np.linalg.inv(V)
print np.diag(np.sqrt(np.dot(np.dot((xx-yy),VI),(xx-yy).T)))
Output:
[ 2.28765854 2.75165028 2.75165028 2.75165028 0. 2.75165028
2.75165028 2.75165028 2.75165028 0. 0. 0. 0.
0. 0. ]
To do this without the intermediate array implicitly created here, you might have to sacrifice a C loop for a Python one:
A = np.dot((xx-yy),VI)
B = (xx-yy).T
n = A.shape[0]
D = np.empty(n)
for i in range(n):
D[i] = np.sqrt(np.sum(A[i] * B[:,i]))
EDIT: actually, with np.einsum
voodoo you can remove the Python loop and speed it up a lot (on my system, from 84.3 µs to 2.9 µs):
D = np.sqrt(np.einsum('ij,ji->i', A, B))
EDIT: As @Warren Weckesser points out, einsum
can be used to do away with the intermediate A
and B
arrays too:
delta = xx - yy
D = np.sqrt(np.einsum('nj,jk,nk->n', delta, VI, delta))
Another simple solution which is just as fast as the einsum
e = xx-yy
X = np.vstack([xx,yy])
V = np.cov(X.T)
p = np.linalg.inv(V)
D = np.sqrt(np.sum(np.dot(e,p) * e, axis = 1))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With