Cluster labels comparison - label match

Question

I am comparing different clustering methods. For example Agglomerative Clustering with K-means, predicting from a sample, etc.

I am in python and mostly using pandas and sklearn.

The issue I have, of course, is that the cluster number the observations are assigned are different for every algorithm and I get something similar to this:

clustering comparison 1

expected clustering comparison 2

I am doing it manually for 8 clusters, but if I had more clusters it's a nightmare.

I think the idea is to relabel the results based on how many of the observations have in common. At the moment is when comparing the same number of clusters which should be easier.

Thanks!

Pallie · Accepted Answer

Build a contingency matrix with the output of both models. If you want a similarity-type scoring use the adjusted rand index.

Elliott de Launay · Answer

contingency matrix worked for my use case, where K=6 and my label was binary: enter image description here

from sklearn.metrics.cluster import contingency_matrix

contingency_matrix(y_val_tr, clustering.labels_)

Outputs something like:

array([[ 8, 15,  7,  0, 19,  9],
       [ 1,  0, 13, 16,  0,  0]])

Where the first row are number of labels agreeing with predicted label 0, and the second row are number of labels agreeing with predicted label 1. For my use case I went column by column and just took the whichever row had the max value to relabeled and evaluate KMeans performance:

enter image description here

Cluster labels comparison - label match

Tags:

python

pandas

cluster-analysis

scikit-learn

Johnny

2 Answers

Pallie

Elliott de Launay

Recent Activity

Donate For Us

Cluster labels comparison - label match

Tags:

python

pandas

cluster-analysis

scikit-learn

Johnny

2 Answers

Pallie

Elliott de Launay

Related questions

Recent Activity

Donate For Us