Python Clustering 'purity' metric

Question

I'm using a Gaussian Mixture Model (GMM) from sklearn.mixture to perform clustering of my data set.

I could use the function score() to compute the log probability under the model.

However, I am looking for a metric called 'purity' which is defined in this article.

How can I implement it in Python? My current implementation looks like this:

from sklearn.mixture import GMM

# X is a 1000 x 2 array (1000 samples of 2 coordinates).
# It is actually a 2 dimensional PCA projection of data
# extracted from the MNIST dataset, but this random array
# is equivalent as far as the code is concerned.
X = np.random.rand(1000, 2)

clusterer = GMM(3, 'diag')
clusterer.fit(X)
cluster_labels = clusterer.predict(X)

# Now I can count the labels for each cluster..
count0 = list(cluster_labels).count(0)
count1 = list(cluster_labels).count(1)
count2 = list(cluster_labels).count(2)

But I can not loop through each cluster in order to compute the confusion matrix (according this question)

Ugurite · Accepted Answer

David's answer works but here is another way to do it.

import numpy as np
from sklearn import metrics

def purity_score(y_true, y_pred):
    # compute contingency matrix (also called confusion matrix)
    contingency_matrix = metrics.cluster.contingency_matrix(y_true, y_pred)
    # return purity
    return np.sum(np.amax(contingency_matrix, axis=0)) / np.sum(contingency_matrix)

Also if you need to compute Inverse Purity, all you need to do is replace "axis=0" by "axis=1".

Python Clustering 'purity' metric

Tags:

python

cluster-analysis

scikit-learn

Kuka

1 Answers

Ugurite

Recent Activity

Donate For Us

Python Clustering 'purity' metric

Tags:

python

cluster-analysis

scikit-learn

Kuka

1 Answers

Ugurite

Related questions

Recent Activity

Donate For Us