Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Validating Output From a Clustering Algorithm

Is there an objective way to validate the output of a clustering algorithm?

I'm using scikit-learn's affinity propagation clustering against a dataset composed of objects with many attributes. The difference matrix supplied to the clustering algorithm is composed of the weighted difference of these attributes. I'm looking for a way to objectively validate tweaks in the distance weightings as reflected in the resulting clusters. The dataset is large and has enough attributes that manual examination of small examples is not a reasonable way to verify the produced clusters.

like image 730
Andrew M Avatar asked Oct 01 '12 19:10

Andrew M


1 Answers

Yes:

Give the clusters to a domain expert, and have him analyze if the structure the algorithm found is sensible. Not so much if it is new, but if it is sensible.

... and No:

There is not automatic evaluation available that is fair. In the sense that it takes the objective of unsupervised clustering into account: knowledge discovery aka: learn something new about your data.

There are two common ways of evaluating clusterings automatically:

  • internal cohesion. I.e. there is some particular property such as in-cluser variance compared to between-cluster variance to minimize. The problem is that it's usually fairly trivial to cheat. I.e. to construct a trivial solution that scores really well. So this method must not be used to compare methods based on different assumptions. You can't even fairly compare different types of linkage for hiearchical clustering.

  • external evaluation. You use a labeled data set, and score algorithms by how well they rediscover existing knowledge. Sometimes this works quite well, so it is an accepted state of the art for evaluation. Yet, any supervised or semi-supervised method will of course score much better on this. As such, it is A) biased towards supervised methods, and B) actually going completely against the knowledge discovery idea of finding something you did not yet know.

If you really mean to use clustering - i.e. learn something about your data - you will at some point have to inspect the clusters, preferrably by a completely independent method such as a domain expert. If he can tell you that e.g. the user group identified by the clustering is a non-trivial group not yet investigated closely, then you are a winner.

However, most people want to have a "one click" (and one-score) evaluation, unfortunately.

Oh, and "clustering" is not really a machine learning task. There actually is no learning involved. To the machine learning community, it is the ugly duckling that nobody cares about.

like image 60
Has QUIT--Anony-Mousse Avatar answered Sep 18 '22 02:09

Has QUIT--Anony-Mousse