Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting an appropriate similarity metric & assessing the validity of a k-means clustering model

I have implemented k-means clustering for determining the clusters in 300 objects. Each of my object has about 30 dimensions. The distance is calculated using the Euclidean metric.

I need to know

  1. How would I determine if my algorithms works correctly? I can't have a graph which will give some idea about the correctness of my algorithm.
  2. Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ?
like image 410
user350556 Avatar asked Dec 10 '22 04:12

user350556


2 Answers

The two questions in the OP are separate topics (i.e., no overlap in the answers), so I'll try to answer them one at a time staring with item 1 on the list.

How would I determine if my [clustering] algorithms works correctly?

k-means, like other unsupervised ML techniques, lacks a good selection of diagnostic tests to answer questions like "are the cluster assignments returned by k-means more meaningful for k=3 or k=5?"

Still, there is one widely accepted test that yields intuitive results and that is straightforward to apply. This diagnostic metric is just this ratio:

inter-centroidal separation / intra-cluster variance

As the value of this ratio increase, the quality of your clustering result increases.

This is intuitive. The first of these metrics is just how far apart is each cluster from the others (measured according to the cluster centers)?

But inter-centroidal separation alone doesn't tell the whole story, because two clustering algorithms could return results having the same inter-centroidal separation though one is clearly better, because the clusters are "tighter" (i.e., smaller radii); in other words, the cluster edges have more separation. The second metric--intra-cluster variance--accounts for this. This is just the mean variance, calculated per cluster.

In sum, the ratio of inter-centroidal separation to intra-cluster variance is a quick, consistent, and reliable technique for comparing results from different clustering algorithms, or to compare the results from the same algorithm run under different variable parameters--e.g., number of iterations, choice of distance metric, number of centroids (value of k).

The desired result is tight (small) clusters, each one far away from the others.

The calculation is simple:

For inter-centroidal separation:

  • calculate the pair-wise distance between cluster centers; then

  • calculate the median of those distances.

For intra-cluster variance:

  • for each cluster, calculate the distance of every data point in a given cluster from its cluster center; next

  • (for each cluster) calculate the variance of the sequence of distances from the step above; then

  • average these variance values.


That's my answer to the first question. Here's the second question:

Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ?

First, the easy question--is Euclidean distance a valid metric as dimensions/features increase?

Euclidean distance is perfectly scalable--works for two dimensions or two thousand. For any pair of data points:

  • subtract their feature vectors element-wise,

  • square each item in that result vector,

  • sum that result,

  • take the square root of that scalar.

Nowhere in this sequence of calculations is scale implicated.

But whether Euclidean distance is the appropriate similarity metric for your problem, depends on your data. For instance, is it purely numeric (continuous)? Or does it have discrete (categorical) variables as well (e.g., gender? M/F) If one of your dimensions is "current location" and of the 200 users, 100 have the value "San Francisco" and the other 100 have "Boston", you can't really say that, on average, your users are from somewhere in Kansas, but that's sort of what Euclidean distance would do.

In any event, since we don't know anything about it, i'll just give you a simple flow diagram so that you can apply it to your data and identify an appropriate similarity metric.

To identify an appropriate similarity metric given your data:

enter image description here

like image 112
doug Avatar answered Jan 18 '23 04:01

doug


  1. Euclidean distance is good when dimensions are comparable and on the same scale. If one dimension represents length and another - weight of item - euclidean should be replaced with weighted.

  2. Make it in 2d and show the picture - this is good option to see visually if it works. Or you may use some sanity check - like to find cluster centers and see that all items in the cluster aren't too away of it.

like image 37
Anton Avatar answered Jan 18 '23 02:01

Anton