Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can you compare two cluster groupings in terms of similarity or overlap in Python?

Simplified example of what I'm trying to do:

Let's say I have 3 data points A, B, and C. I run KMeans clustering on this data and get 2 clusters [(A,B),(C)]. Then I run MeanShift clustering on this data and get 2 clusters [(A),(B,C)]. So clearly the two clustering methods have clustered the data in different ways. I want to be able to quantify this difference. In other words, what metric can I use to determine percent similarity/overlap between the two cluster groupings obtained from the two algorithms? Here is a range of scores that might be given:

  • 100% score for [(A,B),(C)] vs. [(A,B),(C)]
  • ~50% score for [(A,B),(C)] vs. [(A),(B,C)]
  • ~20% score for [(A,B),(C)] vs. [(A,B,C)]

These scores are a bit arbitrary because I'm not sure how to measure similarity between two different cluster groupings. Keep in mind that this is a simplified example, and in real applications you can have many data points and also more than 2 clusters per cluster grouping. Having such a metric is also useful when trying to compare a cluster grouping to a labeled grouping of data (when you have labeled data).

Edit: One idea that I have is to take every cluster in the first cluster grouping and get its percent overlap with every cluster in the second cluster grouping. This would give you a similarity matrix of clusters in the first cluster grouping against clusters in the second cluster grouping. But then I'm not sure what you would do with this matrix. Maybe take the highest similarity score in each row or column and do something with that?

like image 843
DataMan Avatar asked Jul 13 '17 14:07

DataMan


1 Answers

Use evaluation metrics.

Many metrics are symmetric. For example, the adjusted Rand index.

A value close to 1 means they are very similar, close to 0 is random, and much less than 0 means each cluster of one is "evenly" distributed over all clusters of the other.

like image 68
Has QUIT--Anony-Mousse Avatar answered Sep 17 '22 15:09

Has QUIT--Anony-Mousse