Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a Python equivalent to the mahalanobis() function in R? If not, how can I implement it?

I have the following code in R that calculates the mahalanobis distance on the Iris dataset and returns a numeric vector with 150 values, one for every observation in the dataset.

x=read.csv("Iris Data.csv")
mean<-colMeans(x)
Sx<-cov(x)
D2<-mahalanobis(x,mean,Sx)  

I tried to implement the same in Python using 'scipy.spatial.distance.mahalanobis(u, v, VI)' function, but it seems this function takes only one-dimensional arrays as parameters.

like image 725
jose14 Avatar asked Apr 23 '15 07:04

jose14


People also ask

How do you calculate Mahalanobis distance manually?

Then you matrix-multiply that 1×3 vector by the 3×3 inverse covariance matrix to get an intermediate 1×3 result tmp = (-9.9964, -0.1325, 3.4413). Then you multiply the 1×3 intermediate result by the 3×1 transpose (-2, 40, 4) to get the squared 1×1 Mahalanobis Distance result = 28.4573.

What is the difference between Euclidean distance and Mahalanobis distance?

Mahalanobis distance is the scaled Euclidean distance when the covariance matrix is diagonal. In PCA the covariance matrix between components is diagonal. The scaled Euclidean distance is the Euclidean distance where the variables were scaled by their standard deviations.


1 Answers

I used the Iris dataset from R, I suppose it is the same you are using.

First, these is my R benchmark, for comparison:

x <- read.csv("IrisData.csv")
x <- x[,c(2,3,4,5)]
mean<-colMeans(x)
Sx<-cov(x)
D2<-mahalanobis(x,mean,Sx)  

Then, in python you can use:

from scipy.spatial.distance import mahalanobis
import scipy as sp
import pandas as pd

x = pd.read_csv('IrisData.csv')
x = x.ix[:,1:]

Sx = x.cov().values
Sx = sp.linalg.inv(Sx)

mean = x.mean().values

def mahalanobisR(X,meanCol,IC):
    m = []
    for i in range(X.shape[0]):
        m.append(mahalanobis(X.iloc[i,:],meanCol,IC) ** 2)
    return(m)

mR = mahalanobisR(x,mean,Sx)

I defined a function so you can use it in other sets, (observe I use pandas DataFrames as inputs)

Comparing results:

In R

> D2[c(1,2,3,4,5)]

[1] 2.134468 2.849119 2.081339 2.452382 2.462155

In Python:

In [43]: mR[0:5]
Out[45]: 
[2.1344679233248431,
 2.8491186861585733,
 2.0813386639577991,
 2.4523816316796712,
 2.4621545347140477]

Just be careful that what you get in R is the squared Mahalanobis distance.

like image 131
Cristián Antuña Avatar answered Oct 31 '22 04:10

Cristián Antuña