Calculate Mahalanobis distance using NumPy only

Tags:

I am looking for NumPy way of calculating Mahalanobis distance between two numpy arrays (x and y). The following code can correctly calculate the same using cdist function of Scipy. Since this function calculates unnecessary matix in my case, I want more straight way of calculating it using NumPy only.

import numpy as np
from scipy.spatial.distance import cdist

x = np.array([[[1,2,3,4,5],
               [5,6,7,8,5],
               [5,6,7,8,5]],
              [[11,22,23,24,5],
               [25,26,27,28,5],
               [5,6,7,8,5]]])
i,j,k = x.shape

xx = x.reshape(i,j*k).T


y = np.array([[[31,32,33,34,5],
               [35,36,37,38,5],
               [5,6,7,8,5]],
              [[41,42,43,44,5],
               [45,46,47,48,5],
               [5,6,7,8,5]]])


yy = y.reshape(i,j*k).T

results =  cdist(xx,yy,'mahalanobis')
results = np.diag(results)
print results



[ 2.28765854  2.75165028  2.75165028  2.75165028  0.          2.75165028
  2.75165028  2.75165028  2.75165028  0.          0.          0.          0.
  0.          0.        ]

My trial:

VI = np.linalg.inv(np.cov(xx,yy))

print np.sqrt(np.dot(np.dot((xx-yy),VI),(xx-yy).T))

Could anybody correct this method?

Here is formula for it:

http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.mahalanobis.html#scipy.spatial.distance.mahalanobis

702

asked Dec 29 '14 09:12

Borys

2 Answers

I think your problem lies in the construction of your covariance matrix. Try:

X = np.vstack([xx,yy])
V = np.cov(X.T)
VI = np.linalg.inv(V)
print np.diag(np.sqrt(np.dot(np.dot((xx-yy),VI),(xx-yy).T)))

Output:

[ 2.28765854  2.75165028  2.75165028  2.75165028  0.          2.75165028
  2.75165028  2.75165028  2.75165028  0.          0.          0.          0.
  0.          0.        ]

To do this without the intermediate array implicitly created here, you might have to sacrifice a C loop for a Python one:

A = np.dot((xx-yy),VI)
B = (xx-yy).T
n = A.shape[0]
D = np.empty(n)
for i in range(n):
    D[i] = np.sqrt(np.sum(A[i] * B[:,i]))

EDIT: actually, with np.einsum voodoo you can remove the Python loop and speed it up a lot (on my system, from 84.3 µs to 2.9 µs):

D = np.sqrt(np.einsum('ij,ji->i', A, B))

EDIT: As @Warren Weckesser points out, einsum can be used to do away with the intermediate A and B arrays too:

delta = xx - yy
D = np.sqrt(np.einsum('nj,jk,nk->n', delta, VI, delta))

192

answered Sep 23 '22 23:09

xnx

Another simple solution which is just as fast as the einsum

e = xx-yy
X = np.vstack([xx,yy])
V = np.cov(X.T) 
p = np.linalg.inv(V)
D = np.sqrt(np.sum(np.dot(e,p) * e, axis = 1))

answered Sep 25 '22 23:09

David

Related questions
                            
                                How do I upgrade a dependency in a Python project on Heroku
                            
                                Parsing gettext `.po` files with python
                            
                                Argparse - Custom Action With No Argument?
                            
                                Adobe Photoshop-style posterization and OpenCV
                            
                                Modify INI file with Python
                            
                                Reading excel files with xlrd
                            
                                Define a python dictionary with immutable keys but mutable values
                            
                                pymongo default database connection
                            
                                How to set value of a ManyToMany field in Django?
                            
                                Matplotlib colorbar background and label placement
                            
                                How to set settings.LOGIN_URL to a view function name in Django 1.5+
                            
                                Django ModelChoiceField has no plus button
                            
                                Matplotlib imshow/matshow display values on plot
                            
                                Mayavi points3d with different size and colors
                            
                                Python mysql (using pymysql) auto reconnect
                            
                                PyDrive: cannot write file to specific GDrive folder
                            
                                Dot-boxplots from DataFrames
                            
                                run selenium with crontab (python)
                            
                                Indexes of elements in NumPy array that satisfy conditions on the value and the index
                            
                                Why is pandas.Series.std() different from numpy.std()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculate Mahalanobis distance using NumPy only

Tags:

python

numpy

Borys

People also ask

2 Answers

xnx

David

Recent Activity

Donate For Us