Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cluster labels comparison - label match

I am comparing different clustering methods. For example Agglomerative Clustering with K-means, predicting from a sample, etc.

I am in python and mostly using pandas and sklearn.

The issue I have, of course, is that the cluster number the observations are assigned are different for every algorithm and I get something similar to this:

clustering comparison 1

expected clustering comparison 2

I am doing it manually for 8 clusters, but if I had more clusters it's a nightmare.

I think the idea is to relabel the results based on how many of the observations have in common. At the moment is when comparing the same number of clusters which should be easier.

Thanks!

like image 594
Johnny Avatar asked Sep 12 '25 14:09

Johnny


2 Answers

Build a contingency matrix with the output of both models. If you want a similarity-type scoring use the adjusted rand index.

like image 123
Pallie Avatar answered Sep 14 '25 06:09

Pallie


contingency matrix worked for my use case, where K=6 and my label was binary: enter image description here

from sklearn.metrics.cluster import contingency_matrix

contingency_matrix(y_val_tr, clustering.labels_)

Outputs something like:

array([[ 8, 15,  7,  0, 19,  9],
       [ 1,  0, 13, 16,  0,  0]])

Where the first row are number of labels agreeing with predicted label 0, and the second row are number of labels agreeing with predicted label 1. For my use case I went column by column and just took the whichever row had the max value to relabeled and evaluate KMeans performance:

enter image description here

like image 27
Elliott de Launay Avatar answered Sep 14 '25 04:09

Elliott de Launay