After trying a few clustering algorithms, I got the best performance on my dataset using flexclust::kcca
with family = kccaFamily("angle")
.
Here's an example using the Nclus
dataset from flexclust
.
library(fpc)
library(flexclust)
data(Nclus)
k <- 4
family <- flexclust::kccaFamily("angle")
model <- flexclust::kcca(Nclus, k, family)
Now I want to optimise the number of clusters. The most comprehensive set of performance metrics for cluster models seems to be found using fpc::cluster.stats
. This function needs two inputs: a distance matrix, and the clusters that were assigned.
The latter is easy; it is just model@cluster
.
I'm struggling with what to provide for the distance matrix. kcca
doesn't return a distance object (or if it does, I haven't found it).
I guess that typically I would use dist(Nclus)
. In this case, I don't want the Euclidean distance (or any of the other methods available to dist
) – I want the distance measure used by the clustering algorithm. I can recreate the distance matrix used by kcca
using the code from that function.
control <- as(list(), "flexclustControl")
centers <- flexclust:::initCenters(Nclus, k, family, control)
distmat <- distAngle(Nclus, centers$centers)
Then I should just be able to calculate the cluster model stats using
fpc::cluster.stats(distmat, model@cluster)
The trouble is that is giving me two warnings about the the distance argument.
Warning messages:
1: In as.dist.default(d) : non-square matrix
2: In as.matrix.dist(d) :
number of items to replace is not a multiple of replacement length
That makes me suspect I'm giving it the wrong thing.
How should I pass the distance matrix to cluster.stats
?
I guess you should use
distmat <- distAngle(Nclus, Nclus)
However, I am not sure that this makes sense from the modelling viewpoint: to examine the performance of your clustering output you should use the metric which is more suitable to your specific use case; this might (or might not) be the L1 distance:
distmat <- dist(Nclus, "manhattan")
In particular, I'd guess that minimising the "angle between observation and centroid / standardized mean" is not the same as minimising the intra-cluster angle between the observations; also I'd guess that the latter quantity is not particularly useful for modelling purposes. In other words, I'd regard the distAngle as an alternative way ("trick") to identify the k clusters, but I would then evaluate the identified clusters by other metrics. Hope this makes any sense to you...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With