Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

K-means and Mahalanobis distance

Tags:

r

I'd like to use the Mahalanobis distance in the K-means algorithm, because I have 4 variables which are highly correlated (0.85)

It appears to me that it's better to use the Mahalanobis distance in this case.

The problem is I don't know how to implement it in R, with the K-means algorithm.

I think I need to "fake" it in transform the data before the clustering step, but I don't know how.

I tried the classical kmeans, with the euclidian distance on standardize data, but as I said, there is too much correlation.

fit <- kmeans(mydata.standardize, 4)

I also tried to find a distance parameter, but I think it doesn't exist in the kmeans() function.

The expected result is a way to applied the K-means algorithm with the Mahalanobis distance.

like image 820
Ricol Avatar asked Dec 05 '25 00:12

Ricol


1 Answers

You can rescale the data before running the algorithm, using the Cholesky decomposition of the variance matrix: the Euclidian distance after the transformation is the Mahalanobis distance before.

# Sample data 
n <- 100
k <- 5
x <- matrix( rnorm(k*n), nr=n, nc=k )
x[,1:2] <- x[,1:2] %*% matrix( c(.9,1,1,.9), 2, 2 )
var(x)

# Rescale the data
C <- chol( var(x) )
y <- x %*% solve(C)
var(y) # The identity matrix

kmeans(y, 4)

But this assumes that all the clusters have the same shape and orientations as the whole data. If this is not the case, you may want to look at models that explicitly allow for elliptical clusters, e.g., in the mclust package.

like image 107
Vincent Zoonekynd Avatar answered Dec 06 '25 14:12

Vincent Zoonekynd



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!