I am using fpc package in R to perform cluster validation.
I could use the function cluster.stats() to compare my clustering with an external partitioning and compute several metrics like Rand Index, entropy e.t.c.
However, I am looking for a metric called 'purity' or 'cluster accuracy' which is defined in http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html
I am wondering if there is an implementation of this measure in R.
thanks, Chet
I don't know of an off-the-shelf function, but here is one way you could do it yourself using the equation in your link:
ClusterPurity <- function(clusters, classes) {
sum(apply(table(classes, clusters), 2, max)) / length(clusters)
}
Here we can test it on some random assignments, where I believe we expect the purity to be 1/number-of-classes:
> n = 1e6
> classes = sample(3, n, replace=T)
> clusters = sample(5, n, replace=T)
> ClusterPurity(clusters, classes)
[1] 0.334349
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With