Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Clustering 'purity' metric

I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.

I could use the function score() to compute the log probability under the model.

However, I am looking for a metric called 'purity' which is defined in this article.

How can I implement it in Python? My current implementation looks like this:

from sklearn.mixture import GMM

# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)

clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)

# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)

But I can not loop through each cluster in order to compute the confusion matrix (according this question)

like image 217
Kuka Avatar asked Dec 02 '15 16:12

Kuka


1 Answers

David's answer works but here is another way to do it.

import numpy as np
from sklearn import metrics

def purity_score(y_true, y_pred):
    # compute contingency matrix (also called confusion matrix)
    contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
    # return purity
    return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix) 

Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".

like image 80
Ugurite Avatar answered Sep 23 '22 20:09

Ugurite