Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Clustering 'purity' metric

I am using fpc package in R to perform cluster validation.

I could use the function cluster.stats() to compare my clustering with an external partitioning and compute several metrics like Rand Index, entropy e.t.c.

However, I am looking for a metric called 'purity' or 'cluster accuracy' which is defined in http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

I am wondering if there is an implementation of this measure in R.

thanks, Chet

like image 533
chet Avatar asked Feb 12 '12 23:02

chet


1 Answers

I don't know of an off-the-shelf function, but here is one way you could do it yourself using the equation in your link:

ClusterPurity <- function(clusters, classes) {
  sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}

Here we can test it on some random assignments, where I believe we expect the purity to be 1/number-of-classes:

> n = 1e6
> classes = sample(3, n, replace=T)
> clusters = sample(5, n, replace=T)
> ClusterPurity(clusters, classes)
[1] 0.334349
like image 165
John Colby Avatar answered Nov 15 '22 00:11

John Colby