I'm trying to evaluate what is the right number of cluster needed for clusterize some data.
I know that this is possible using Davies–Bouldin Index (DBI).
To using DBI you have to compute it for any number of cluster and the one that minimize the DBI corresponds to the right number of cluster needed.
The question is:
how to know if 2 clusters are better than 1 cluster using DBI? So, how can I compute DBI when I have just 1 cluster?
Only considering the average DBI
of all clusters apparently is not a good idea.
Certainly, increasing the number of clusters - k
, without penalty, will always reduce the amount of DBI in the resulting clustering, to the extreme case of zero DBI
if each data point is considered its own cluster (because each data point overlaps with its own centroid).
how to know if 2 clusters are better than 1 cluster using DBI? So, how can I compute DBI when I have just 1 cluster?
So it's hard to say which one is better if you only use the average DBI
as the performance metric.
A good practical method is to use the Elbow method.
Another method looks at the percentage of variance explained as a function of the number of clusters: You should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters are chosen at this point, hence the "elbow criterion".
Some other good alternatives with respective to choosing the optimal number of clusters:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With