Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to generate performance stats of clustering from flexclust?

After trying a few clustering algorithms, I got the best performance on my dataset using flexclust::kcca with family = kccaFamily("angle").

Here's an example using the Nclus dataset from flexclust.

library(fpc)
library(flexclust)
data(Nclus)

k <- 4
family <- flexclust::kccaFamily("angle")
model <- flexclust::kcca(Nclus, k, family)

Now I want to optimise the number of clusters. The most comprehensive set of performance metrics for cluster models seems to be found using fpc::cluster.stats. This function needs two inputs: a distance matrix, and the clusters that were assigned.

The latter is easy; it is just model@cluster.

I'm struggling with what to provide for the distance matrix. kcca doesn't return a distance object (or if it does, I haven't found it).

I guess that typically I would use dist(Nclus). In this case, I don't want the Euclidean distance (or any of the other methods available to dist) – I want the distance measure used by the clustering algorithm. I can recreate the distance matrix used by kcca using the code from that function.

control <- as(list(), "flexclustControl")
centers <- flexclust:::initCenters(Nclus, k, family, control)
distmat <- distAngle(Nclus, centers$centers)

Then I should just be able to calculate the cluster model stats using

fpc::cluster.stats(distmat, model@cluster)

The trouble is that is giving me two warnings about the the distance argument.

Warning messages:
1: In as.dist.default(d) : non-square matrix
2: In as.matrix.dist(d) :
  number of items to replace is not a multiple of replacement length

That makes me suspect I'm giving it the wrong thing.

How should I pass the distance matrix to cluster.stats?

like image 889
Richie Cotton Avatar asked Aug 03 '16 06:08

Richie Cotton


1 Answers

I guess you should use

distmat <- distAngle(Nclus, Nclus)

However, I am not sure that this makes sense from the modelling viewpoint: to examine the performance of your clustering output you should use the metric which is more suitable to your specific use case; this might (or might not) be the L1 distance:

distmat <- dist(Nclus, "manhattan")

In particular, I'd guess that minimising the "angle between observation and centroid / standardized mean" is not the same as minimising the intra-cluster angle between the observations; also I'd guess that the latter quantity is not particularly useful for modelling purposes. In other words, I'd regard the distAngle as an alternative way ("trick") to identify the k clusters, but I would then evaluate the identified clusters by other metrics. Hope this makes any sense to you...

like image 179
renato vitolo Avatar answered Oct 20 '22 01:10

renato vitolo